πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Text Summarization

Generative NLPExtractive vs Abstractive Summarization🟒 Free Lesson

Advertisement

Text Summarization

Text summarization generates a concise, coherent summary of a longer text while preserving key information. It's essential for information overload management.

Summarization Approaches

ApproachMethodProsCons
ExtractiveSelect important sentencesFaithful to sourceLess fluent
AbstractiveGenerate new textMore naturalCan hallucinate
HybridExtract then paraphraseBalancedComplex

Extractive Summarization

Extractive methods select the most important sentences from the source document.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk

def extractive_summary(document, num_sentences=3):
    sentences = nltk.sent_tokenize(document)

    # Create TF-IDF matrix
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Compute sentence scores (average TF-IDF)
    sentence_scores = np.array([
        tfidf_matrix[i].mean() for i in range(len(sentences))
    ])

    # Select top sentences
    top_indices = np.argsort(sentence_scores)[-num_sentences:]
    top_indices = sorted(top_indices)  # Maintain original order

    summary = ' '.join([sentences[i] for i in top_indices])
    return summary

# Example
doc = """
Natural language processing is a subfield of AI. It focuses on the interaction
between computers and human language. NLP combines linguistics and computer science.
Recent advances in deep learning have revolutionized NLP. Transformers are the
state-of-the-art architecture for most NLP tasks.
"""
print(extractive_summary(doc, num_sentences=2))

Abstractive Summarization with BART

from transformers import BartForConditionalGeneration, BartTokenizer

model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

def summarize(text, max_length=150, min_length=40):
    inputs = tokenizer(
        text,
        max_length=1024,
        truncation=True,
        return_tensors="pt"
    )

    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=max_length,
        min_length=min_length,
        length_penalty=2.0,
        num_beams=4,
        no_repeat_ngram_size=3,
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

text = """
The transformer architecture has revolutionized natural language processing.
Introduced in 2017, it replaced recurrent neural networks with self-attention.
Transformers enable parallel processing of sequences, dramatically speeding up training.
The architecture has spawned models like BERT, GPT, and T5. These models achieve
state-of-the-art results on virtually every NLP benchmark.
"""
print(summarize(text))

ROUGE Evaluation Metrics

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated and reference summaries.

DfROUGE-N

DfROUGE-L (F1)

DfROUGE-L Score

Where:

  • X: Reference summary (length m)
  • Y: Generated summary (length n)
  • LCS: Longest Common Subsequence
  • Ξ²: F-measure weight (typically Ξ² = 1.2 for recall emphasis)
from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(
    ['rouge1', 'rouge2', 'rougeL'],
    use_stemmer=True
)

reference = "The transformer architecture revolutionized NLP with self-attention."
generated = "Transformers changed NLP by introducing self-attention mechanisms."

scores = scorer.score(reference, generated)

for metric, score in scores.items():
    print(f"{metric}: P={score.precision:.3f} R={score.recall:.3f} F={score.fmeasure:.3f}")
# rouge1: P=0.556 R=0.500 F=0.526
# rouge2: P=0.250 R=0.222 F=0.235
# rougeL: P=0.444 R=0.400 F=0.421

ROUGE Scores Comparison

ModelROUGE-1ROUGE-2ROUGE-L
Lead-3 (baseline)40.117.636.6
BERTSum43.220.139.6
BART44.221.340.9
PEGASUS44.221.541.1
T543.520.840.2

PEGASUS for Summarization

PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization) is specifically pre-trained for summarization.

from transformers import PegasusForConditionalGeneration, PegasusTokenizer

model_name = "google/pegasus-large"
tokenizer = PegasusTokenizer.from_pretrained(model_name)
model = PegasusForConditionalGeneration.from_pretrained(model_name)

def pegasus_summarize(text, max_length=128):
    inputs = tokenizer(text, return_tensors="pt", max_length=1024, truncation=True)
    summary = model.generate(
        **inputs,
        max_length=max_length,
        num_beams=8,
        length_penalty=0.6
    )
    return tokenizer.decode(summary[0], skip_special_tokens=True)

Training Objective

DfSummarization Loss

Where x is the source document and y is the target summary.

ROUGE-1 Calculation

ROUGE-1 captures fluency and adequacy, ROUGE-2 captures fluency and structure, and ROUGE-L captures sentence structure through longest common subsequence.

⭐

Premium Content

Text Summarization

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement