Text Summarization

Text summarization generates a concise, coherent summary of a longer text while preserving key information. It's essential for information overload management.

Summarization Approaches

Approach	Method	Pros	Cons
Extractive	Select important sentences	Faithful to source	Less fluent
Abstractive	Generate new text	More natural	Can hallucinate
Hybrid	Extract then paraphrase	Balanced	Complex

Extractive Summarization

Extractive methods select the most important sentences from the source document.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

def extractive_summary(document, num_sentences=3):
    sentences = nltk.sent_tokenize(document)

    # Create TF-IDF matrix
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Compute sentence scores (average TF-IDF)
    sentence_scores = np.array([
        tfidf_matrix[i].mean() for i in range(len(sentences))
    ])

    # Select top sentences
    top_indices = np.argsort(sentence_scores)[-num_sentences:]
    top_indices = sorted(top_indices)  # Maintain original order

    summary = ' '.join([sentences[i] for i in top_indices])
    return summary

# Example
doc = """
Natural language processing is a subfield of AI. It focuses on the interaction
between computers and human language. NLP combines linguistics and computer science.
Recent advances in deep learning have revolutionized NLP. Transformers are the
state-of-the-art architecture for most NLP tasks.
"""
print(extractive_summary(doc, num_sentences=2))

Abstractive Summarization with BART

from transformers import BartForConditionalGeneration, BartTokenizer

model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

def summarize(text, max_length=150, min_length=40):
    inputs = tokenizer(
        text,
        max_length=1024,
        truncation=True,
        return_tensors="pt"
    )

    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        no_repeat_ngram_size=3,
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

text = """
The transformer architecture has revolutionized natural language processing.
Introduced in 2017, it replaced recurrent neural networks with self-attention.
Transformers enable parallel processing of sequences, dramatically speeding up training.
The architecture has spawned models like BERT, GPT, and T5. These models achieve
state-of-the-art results on virtually every NLP benchmark.
"""
print(summarize(text))

ROUGE Evaluation Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated and reference summaries.

DfROUGE-N

DfROUGE-L (F1)

DfROUGE-L Score

Where:

X: Reference summary (length m)
Y: Generated summary (length n)
LCS: Longest Common Subsequence
β: F-measure weight (typically β = 1.2 for recall emphasis)

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(
    ['rouge1', 'rouge2', 'rougeL'],
    use_stemmer=True
)

reference = "The transformer architecture revolutionized NLP with self-attention."
generated = "Transformers changed NLP by introducing self-attention mechanisms."

scores = scorer.score(reference, generated)

for metric, score in scores.items():
    print(f"{metric}: P={score.precision:.3f} R={score.recall:.3f} F={score.fmeasure:.3f}")
# rouge1: P=0.556 R=0.500 F=0.526
# rouge2: P=0.250 R=0.222 F=0.235
# rougeL: P=0.444 R=0.400 F=0.421

ROUGE Scores Comparison

Model	ROUGE-1	ROUGE-2	ROUGE-L
Lead-3 (baseline)	40.1	17.6	36.6
BERTSum	43.2	20.1	39.6
BART	44.2	21.3	40.9
PEGASUS	44.2	21.5	41.1
T5	43.5	20.8	40.2

PEGASUS for Summarization

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is specifically pre-trained for summarization.

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

def pegasus_summarize(text, max_length=128):
    inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
    summary = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=8,
        length_penalty=0.6
    )
    return tokenizer.decode(summary[0], skip_special_tokens=True)

Training Objective

DfSummarization Loss

Where x is the source document and y is the target summary.

ROUGE-1 Calculation

ROUGE-1 captures fluency and adequacy, ROUGE-2 captures fluency and structure, and ROUGE-L captures sentence structure through longest common subsequence.

Text Summarization