πŸŽ‰ 75% of content is free forever β€” Unlock Premium from $10/mo β†’
CW
Search courses…
πŸ’Ό Servicesℹ️ Aboutβœ‰οΈ ContactView Pricing Plansfrom $10

Sequence-to-Sequence Models

Neural NLPEncoder-Decoder🟒 Free Lesson

Advertisement

Sequence-to-Sequence Models

Sequence-to-sequence (seq2seq) models map input sequences to output sequences of potentially different lengths. They're foundational for machine translation, text summarization, dialogue systems, and code generation.

How Seq2Seq Works

  1. Encoder reads input token-by-token and produces hidden states
  2. Context vector (final encoder hidden state) summarizes the input
  3. Decoder generates output tokens using the context vector
  4. Each decoder step uses previous output as input for next step

Encoder Hidden State

ht=f(xt,htβˆ’1)h_t = f(x_t, h_{t-1})

Decoder Generation

st=g(ytβˆ’1,stβˆ’1,c)s_t = g(y_{t-1}, s_{t-1}, c)

Teacher Forcing

During training, the decoder receives the actual previous token (ground truth) instead of its own prediction.

class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[0]
        trg_len = trg.shape[1]
        trg_vocab_size = self.decoder.output_dim

        outputs = torch.zeros(batch_size, trg_len, trg_vocab_size).to(self.device)

        # Encode
        hidden, cell = self.encoder(src)

        # First decoder input is <sos> token
        input = trg[:, 0]

        for t in range(1, trg_len):
            output, hidden, cell = self.decoder(input, hidden, cell)
            outputs[:, t] = output

            # Teacher forcing
            teacher_force = torch.rand(1) < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = trg[:, t] if teacher_force else top1

        return outputs

Beam Search Decoding

Beam search explores multiple candidate sequences to find a high-probability output.

def beam_search(model, src, beam_width=5, max_len=50, sos_idx=1, eos_idx=2):
    hidden, cell = model.encoder(src)

    # Initialize beams: (log_probability, token_sequence, hidden, cell)
    beams = [(0.0, [sos_idx], hidden, cell)]
    completed = []

    for _ in range(max_len):
        all_candidates = []
        for score, seq, h, c in beams:
            if seq[-1] == eos_idx:
                completed.append((score, seq))
                continue

            output, new_h, new_c = model.decoder(
                torch.tensor([seq[-1]]), h, c
            )
            log_probs = torch.log_softmax(output[:, -1], dim=-1)

            topk = torch.topk(log_probs, beam_width)
            for i in range(beam_width):
                token = topk.indices[0][i].item()
                new_score = score + topk.values[0][i].item()
                all_candidates.append((new_score, seq + [token], new_h, new_c))

        # Keep top beam_width candidates
        beams = sorted(all_candidates, key=lambda x: x[0], reverse=True)[:beam_width]

    completed.extend([(score, seq) for score, seq, _, _ in beams])
    return max(completed, key=lambda x: x[0] / len(x[1]))

Encoder-Decoder Implementation

class Encoder(nn.Module):
    def __init__(self, input_dim, embed_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.embedding = nn.Embedding(input_dim, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, n_layers,
                          batch_first=True, bidirectional=True)
        self.fc_hidden = nn.Linear(hidden_dim * 2, hidden_dim)
        self.fc_cell = nn.Linear(hidden_dim * 2, hidden_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, src):
        embedded = self.dropout(self.embedding(src))
        outputs, (hidden, cell) = self.rnn(embedded)

        # Combine bidirectional states
        hidden = torch.cat((hidden[-2], hidden[-1]), dim=1)
        cell = torch.cat((cell[-2], cell[-1]), dim=1)
        hidden = torch.tanh(self.fc_hidden(hidden))
        cell = torch.tanh(self.fc_cell(cell))

        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, output_dim, embed_dim, hidden_dim, n_layers, dropout):
        super().__init__()
        self.embedding = nn.Embedding(output_dim, embed_dim)
        self.rnn = nn.LSTM(embed_dim, hidden_dim, n_layers, batch_first=True)
        self.fc_out = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, input, hidden, cell):
        input = input.unsqueeze(1)
        embedded = self.dropout(self.embedding(input))
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        prediction = self.fc_out(output.squeeze(1))
        return prediction, hidden, cell

Seq2Seq Applications

ApplicationInputOutputKey Challenge
Machine TranslationEnglish textFrench textWord order differences
Text SummarizationLong documentShort summaryMaintaining key information
Dialogue SystemsUser querySystem responseCoherence, context
Code GenerationNatural languageSource codeSyntax correctness
Image CaptioningImage featuresDescription textMultimodal alignment
Speech RecognitionAudio featuresTranscriptReal-time processing

Attention Mechanism

The context vector bottleneck limits seq2seq models. Attention allows the decoder to focus on different parts of the input at each step.

Attention Score

Ξ±t,s=exp⁑(et,s)βˆ‘k=1Txexp⁑(et,k)\alpha_{t,s} = \frac{\exp(e_{t,s})}{\sum_{k=1}^{T_x} \exp(e_{t,k})}

Context Vector with Attention

ct=βˆ‘s=1TxΞ±t,shsc_t = \sum_{s=1}^{T_x} \alpha_{t,s} h_s
class Attention(nn.Module):
    def __init__(self, enc_dim, dec_dim):
        super().__init__()
        self.attn = nn.Linear(enc_dim + dec_dim, dec_dim)
        self.v = nn.Linear(dec_dim, 1)

    def forward(self, hidden, encoder_outputs):
        src_len = encoder_outputs.shape[1]
        hidden = hidden.unsqueeze(1).repeat(1, src_len, 1)
        energy = torch.tanh(self.attn(torch.cat([hidden, encoder_outputs], dim=2)))
        attention = self.v(energy).squeeze(2)
        return torch.softmax(attention, dim=1)

Evaluation Metrics

BLEU Score

BLEU=BPβ‹…exp⁑(βˆ‘n=1Nwnlog⁑pn)BLEU = BP \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right)
MetricRangeBest For
BLEU0-1Machine translation
ROUGE0-1Summarization
METEOR0-1Translation quality
CIDEr0-∞Image captioning
⭐

Premium Content

Sequence-to-Sequence Models

Unlock this lesson and 900+ advanced tutorials with a Premium plan.

🎯End-to-end Projects
πŸ’ΌInterview Prep
πŸ“œCertificates
🀝Community Access

Already a member? Log in

Need Expert NLP Help?

Get personalized tutoring, project support, or professional consulting.

Advertisement