Welcome back, future AI engineer! In our previous chapters, we mastered the fundamentals of deep learning with feedforward neural networks (FNNs). We learned how these networks excel at tasks where inputs are independent and fixed in size, like classifying images or predicting a single value from a structured dataset.

But what happens when the order of your data matters? What if your input isn’t a single, fixed-size vector, but a sequence of varying length, where each element’s meaning is influenced by what came before it? Think about natural language, where the meaning of a word depends on the preceding words, or time series data, where future values are influenced by past observations. Traditional FNNs hit a wall here because they lack “memory” and treat each input independently.

This chapter introduces you to the fascinating world of Recurrent Neural Networks (RNNs). You’ll discover how RNNs are designed to process sequential data by maintaining an internal “state” or “memory” that captures information from previous steps in the sequence. We’ll start with the basic RNN architecture, understand its limitations like vanishing gradients, and then explore more sophisticated variants like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) that overcome these challenges. By the end, you’ll be able to implement these powerful models using PyTorch, preparing you for real-world tasks in areas like natural language processing, speech recognition, and time series forecasting.

Ready to add the power of memory to your neural networks? Let’s dive in!

The Challenge of Sequence Data

Imagine you’re trying to predict the next word in a sentence: “The cat sat on the ___.” To fill in that blank, you need to remember “cat” and “sat” to infer that “mat” or “rug” are likely candidates. A standard FNN, which takes a fixed-size input and produces a fixed-size output, struggles with this. It doesn’t inherently understand the concept of order or carry information from one input to the next.

Here’s why FNNs fall short with sequences:

  • Fixed Input Size: FNNs require a predefined number of input features. How would you represent a sentence that could be 3 words long or 30 words long? Padding shorter sequences or truncating longer ones can lose valuable information.
  • No Memory: Each input to an FNN is processed independently. There’s no mechanism for the network to remember previous inputs in a sequence, which is crucial for understanding context.
  • Lack of Parameter Sharing: If you tried to process each word in a sentence with a separate FNN, you’d end up with a huge number of parameters, and the network wouldn’t generalize well across different positions in a sequence.

Introducing Recurrent Neural Networks (RNNs)

RNNs solve these problems by introducing a “loop” or “recurrence” in their architecture. This loop allows information to persist from one step of the sequence to the next. Think of it as the network having a short-term memory.

How a Basic RNN Works

At its core, an RNN processes a sequence element by element, and at each step, it takes two inputs:

  1. The current element from the input sequence.
  2. The “hidden state” from the previous time step.

It then produces two outputs:

  1. An output for the current time step (e.g., a prediction).
  2. An updated hidden state that is passed to the next time step.

This hidden state is where the “memory” lives. It’s a vector that encapsulates information gathered from all previous elements in the sequence.

To better visualize this, let’s “unroll” the RNN over time. This shows how the network processes each element of the sequence sequentially, passing its internal state forward.

flowchart LR X0[Input X0] --> RNN0 H0(Hidden State H0) -->|\1| RNN0 RNN0 -->|Output Y0| Y0[Output Y0] RNN0 -->|Hidden State H1| H1(Hidden State H1) X1[Input X1] --> RNN1 H1 --> RNN1 RNN1 -->|Output Y1| Y1[Output Y1] RNN1 -->|Hidden State H2| H2(Hidden State H2) X2[Input X2] --> RNN2 H2 --> RNN2 RNN2 -->|Output Y2| Y2[Output Y2] RNN2 -->|Hidden State H3| H3(Hidden State H3)

In this diagram:

  • X0, X1, X2 are the input elements at different time steps.
  • H0, H1, H2, H3 are the hidden states. H0 is usually initialized to zeros.
  • Y0, Y1, Y2 are the outputs at different time steps.
  • Notice that the RNN block (representing the recurrent layer) is the same block at each time step. This means the same set of weights is used across the entire sequence, which is a powerful form of parameter sharing.

The calculation for the hidden state h_t and output y_t at time t typically looks something like this:

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h) y_t = W_hy * h_t + b_y

Where:

  • x_t is the input at time t.
  • h_t is the hidden state at time t.
  • W_hh, W_xh, W_hy are weight matrices.
  • b_h, b_y are bias vectors.
  • tanh is a common activation function (like ReLU, but often preferred in RNNs for its output range).

The Vanishing/Exploding Gradient Problem

While basic RNNs are conceptually elegant, they suffer from a significant practical limitation, especially when dealing with long sequences: the vanishing or exploding gradient problem.

During backpropagation through time (BPTT), gradients are calculated by multiplying them back through the network’s layers at each time step.

  • Vanishing Gradients: If the weights in the recurrent connections are small, these gradients can shrink exponentially as they propagate backward through many time steps. This means that information from early parts of a long sequence effectively gets “forgotten” by the time it reaches later parts, making it hard for the network to learn long-term dependencies.
  • Exploding Gradients: Conversely, if the weights are large, the gradients can grow exponentially, leading to unstable training, numerical overflow, and wildly fluctuating updates.

This problem makes training plain RNNs on complex, long sequences very difficult. Thankfully, researchers developed more advanced architectures to mitigate this.

Long Short-Term Memory (LSTM) Networks

LSTMs are a special kind of RNN designed specifically to overcome the vanishing gradient problem and learn long-term dependencies. They do this by introducing a sophisticated internal mechanism called a cell state and several gates that regulate the flow of information.

Think of the cell state as a conveyor belt that runs through the entire sequence. Information can be added to it or removed from it, but it flows relatively unchanged. The gates are neural networks that decide what information to keep, what to discard, and what to output.

The three main gates in an LSTM unit are:

  1. Forget Gate: Decides what information from the previous cell state C_{t-1} should be thrown away.
  2. Input Gate: Decides what new information from the current input x_t and previous hidden state h_{t-1} should be stored in the cell state C_t.
  3. Output Gate: Decides what parts of the cell state C_t should be output as the current hidden state h_t.

By carefully controlling these gates, LSTMs can selectively remember or forget information over very long sequences, making them incredibly powerful for tasks like machine translation, speech recognition, and complex time series prediction.

Gated Recurrent Unit (GRU) Networks

GRUs are another popular variant of RNNs, similar to LSTMs but with a simpler architecture. They were introduced to provide a more computationally efficient alternative while still addressing the vanishing gradient problem.

Instead of three gates and a separate cell state, GRUs combine the forget and input gates into a single update gate and also feature a reset gate. They merge the hidden state and cell state into a single “hidden state.”

  • Update Gate: Controls how much of the previous hidden state should be carried over to the current hidden state.
  • Reset Gate: Controls how much of the previous hidden state should be “forgotten” when computing the new candidate hidden state.

GRUs typically have fewer parameters than LSTMs, which can lead to faster training and less data required to generalize well. For many tasks, their performance is comparable to LSTMs, making them an excellent choice when computational resources are a concern or when simpler models are preferred.

Step-by-Step Implementation with PyTorch

Let’s get our hands dirty and implement a simple RNN and then an LSTM using PyTorch. We’ll use a very basic example: predicting the next character in a sequence.

First, ensure you have PyTorch installed. As of January 2026, PyTorch 2.x is the stable release, offering significant performance improvements.

# Verify Python version (e.g., 3.10 or newer)
python --version

# Install PyTorch (example for CPU, adjust for GPU if needed)
# For the most up-to-date instructions, always check the official PyTorch website.
# Example for PyTorch 2.x (CPU only)
pip install torch==2.* torchvision==0.* torchaudio==0.* --index-url https://download.pytorch.org/whl/cpu

Make sure to replace 2.* 0.* with the exact latest stable versions if they differ slightly from what’s implied here, for example 2.3.0 and 0.18.0.

1. Preparing Sequence Data (Character-Level)

We’ll create a simple dataset to train our RNN. Let’s try to predict the next character in the word “hello”.

import torch
import torch.nn as nn
import torch.optim as optim

# Our simple sequence
word = "hello"
chars = sorted(list(set(word))) # Get unique characters and sort them
char_to_idx = {char: i for i, char in enumerate(chars)}
idx_to_char = {i: char for i, char in enumerate(chars)}

# Vocabulary size
vocab_size = len(chars)
print(f"Vocabulary: {chars}")
print(f"Char to Index: {char_to_idx}")
print(f"Index to Char: {idx_to_char}")
print(f"Vocabulary Size: {vocab_size}")

This snippet defines our vocabulary and creates mappings between characters and their integer indices. This is a fundamental step for any text-based task in deep learning.

Next, we’ll prepare our input-output pairs. For “hello”:

  • Input: “h”, Target: “e”
  • Input: “he”, Target: “l”
  • Input: “hel”, Target: “l”
  • Input: “hell”, Target: “o”

Each input character will be one-hot encoded.

def one_hot_encode(idx, vocab_size):
    # Creates a one-hot vector for a given index
    vec = torch.zeros(vocab_size)
    vec[idx] = 1
    return vec

# Create training data
X = [] # Inputs
y = [] # Targets

for i in range(len(word) - 1):
    input_char_idx = char_to_idx[word[i]]
    target_char_idx = char_to_idx[word[i+1]]

    # For a basic RNN, we'll feed one character at a time initially
    # For sequence prediction, input is one-hot of current char, target is index of next char
    X.append(one_hot_encode(input_char_idx, vocab_size))
    y.append(target_char_idx)

# Stack them into tensors
X_tensor = torch.stack(X).unsqueeze(0) # Add batch dimension (batch_size=1)
y_tensor = torch.tensor(y)

print(f"\nInput (X_tensor) shape: {X_tensor.shape}") # Should be (1, sequence_length, vocab_size)
print(f"Target (y_tensor) shape: {y_tensor.shape}") # Should be (sequence_length)

Explanation:

  • one_hot_encode: This function converts an integer index into a vector where only the position corresponding to the index is 1, and others are 0. This is how categorical data is usually fed into neural networks.
  • X and y lists: We iterate through the word to create pairs. X will store the one-hot encoded input character, and y will store the integer index of the target (next) character.
  • unsqueeze(0): PyTorch RNN layers expect input in the format (batch_size, sequence_length, input_size) when batch_first=True. Since we have a single sequence, we add a batch dimension of 1.

2. Building a Simple RNN Model

Now, let’s define our basic RNN. We’ll use torch.nn.RNN.

class SimpleRNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleRNN, self).__init__()
        self.hidden_size = hidden_size
        
        # The RNN layer itself
        # batch_first=True means input/output tensors are (batch, seq, feature)
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
        
        # A linear layer to map the RNN's output to our desired output size (vocab_size)
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, input_seq):
        # input_seq shape: (batch_size, sequence_length, input_size)
        
        # Initialize hidden state with zeros
        # hidden_state shape: (num_layers * num_directions, batch_size, hidden_size)
        # For a simple RNN, num_layers=1, num_directions=1
        h0 = torch.zeros(1, input_seq.size(0), self.hidden_size).to(input_seq.device)
        
        # Pass input sequence and initial hidden state through the RNN layer
        # output shape: (batch_size, sequence_length, hidden_size)
        # h_n shape: (num_layers * num_directions, batch_size, hidden_size) - final hidden state
        output, h_n = self.rnn(input_seq, h0)
        
        # We want to predict for each step, so we take the output from all time steps
        # Reshape output to (batch_size * sequence_length, hidden_size) for the linear layer
        output = output.reshape(-1, self.hidden_size)
        
        # Pass through the final linear layer
        output = self.fc(output)
        
        return output

# Model parameters
input_size = vocab_size    # Each character is a one-hot vector of vocab_size
hidden_size = 16           # Number of features in the hidden state
output_size = vocab_size   # We want to predict one of the vocab_size characters

# Instantiate the model
model = SimpleRNN(input_size, hidden_size, output_size)
print(f"\nSimple RNN Model:\n{model}")

Explanation:

  • __init__: We define the nn.RNN layer. input_size is the dimension of each input element (our one-hot vector). hidden_size is the dimension of the hidden state. batch_first=True is a common and convenient setting for PyTorch RNNs, meaning the batch dimension comes first. We also add a nn.Linear layer to map the RNN’s hidden state outputs to our desired output dimension (the vocabulary size).
  • forward:
    • h0: We initialize the hidden state to zeros. The shape for h0 is (num_layers * num_directions, batch_size, hidden_size). Since we have one layer and it’s unidirectional, it’s (1, batch_size, hidden_size).
    • self.rnn(input_seq, h0): This is the core call. It returns two things: output (the hidden state at each time step, after passing through the non-linearity) and h_n (the final hidden state of the last layer for the last time step).
    • output.reshape(-1, self.hidden_size): We want to predict a character at each step. The output tensor from rnn has shape (batch_size, sequence_length, hidden_size). We flatten the batch_size and sequence_length dimensions so that each hidden state from each time step can be passed independently through the final fc layer to produce a prediction.

3. Training the Simple RNN

Let’s train our SimpleRNN to predict the next character.

# Loss function and optimizer
criterion = nn.CrossEntropyLoss() # Suitable for multi-class classification
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
num_epochs = 100
print("\nTraining Simple RNN...")
for epoch in range(num_epochs):
    model.train() # Set model to training mode
    optimizer.zero_grad() # Clear gradients

    outputs = model(X_tensor) # Forward pass
    loss = criterion(outputs, y_tensor) # Calculate loss

    loss.backward() # Backward pass
    optimizer.step() # Update weights

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("Training finished.")

# Let's see how well it learned
model.eval() # Set model to evaluation mode
with torch.no_grad():
    test_input = one_hot_encode(char_to_idx['h'], vocab_size).unsqueeze(0).unsqueeze(0) # (1, 1, vocab_size)
    
    # Simulate sequence generation for "hello"
    generated_word = "h"
    hidden = torch.zeros(1, 1, hidden_size) # Initial hidden state

    # To predict the next character, we need to pass the current predicted character back as input
    # This is a simplified generation loop just for demonstration
    # Real generation would involve feeding the model its own predictions
    for _ in range(len(word) - 1):
        output, hidden = model.rnn(test_input, hidden)
        prediction = model.fc(output.squeeze(0)) # Remove batch and sequence dim
        
        predicted_idx = torch.argmax(prediction).item()
        predicted_char = idx_to_char[predicted_idx]
        generated_word += predicted_char
        
        # Prepare next input: one-hot of the predicted character
        test_input = one_hot_encode(predicted_idx, vocab_size).unsqueeze(0).unsqueeze(0)

print(f"Generated word (starting with 'h'): {generated_word}")

Explanation:

  • Training Loop: Standard PyTorch training loop. We use nn.CrossEntropyLoss because our task is effectively a multi-class classification problem at each time step (predicting which of the vocab_size characters is next). optim.Adam is a robust choice for optimization.
  • Prediction/Generation: After training, we evaluate the model. We start with ‘h’, feed it to the RNN, get a prediction, then feed that predicted character back as the next input, and so on. This demonstrates the sequence generation capability. Note that for a single character input, we need to add both batch and sequence length dimensions.

4. Building an LSTM Model

Now, let’s upgrade our SimpleRNN to use an LSTM. The structure will be very similar, but we’ll use torch.nn.LSTM instead.

class SimpleLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(SimpleLSTM, self).__init__()
        self.hidden_size = hidden_size
        
        # The LSTM layer itself
        self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
        
        # A linear layer to map the LSTM's output to our desired output size
        self.fc = nn.Linear(hidden_size, output_size)

    def forward(self, input_seq):
        # input_seq shape: (batch_size, sequence_length, input_size)
        
        # Initialize hidden state (h0) and cell state (c0) with zeros
        # Both shapes: (num_layers * num_directions, batch_size, hidden_size)
        h0 = torch.zeros(1, input_seq.size(0), self.hidden_size).to(input_seq.device)
        c0 = torch.zeros(1, input_seq.size(0), self.hidden_size).to(input_seq.device)
        
        # Pass input sequence and initial states through the LSTM layer
        # output shape: (batch_size, sequence_length, hidden_size)
        # (h_n, c_n) are the final hidden and cell states
        output, (h_n, c_n) = self.lstm(input_seq, (h0, c0))
        
        # Reshape output for the linear layer
        output = output.reshape(-1, self.hidden_size)
        
        # Pass through the final linear layer
        output = self.fc(output)
        
        return output

# Instantiate the LSTM model
lstm_model = SimpleLSTM(input_size, hidden_size, output_size)
print(f"\nSimple LSTM Model:\n{lstm_model}")

# Loss function and optimizer for LSTM
lstm_criterion = nn.CrossEntropyLoss()
lstm_optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)

# Training loop for LSTM
print("\nTraining Simple LSTM...")
for epoch in range(num_epochs):
    lstm_model.train()
    lstm_optimizer.zero_grad()

    outputs = lstm_model(X_tensor)
    loss = lstm_criterion(outputs, y_tensor)

    loss.backward()
    lstm_optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

print("LSTM Training finished.")

# LSTM generation
lstm_model.eval()
with torch.no_grad():
    test_input = one_hot_encode(char_to_idx['h'], vocab_size).unsqueeze(0).unsqueeze(0)
    
    generated_word_lstm = "h"
    hidden_lstm = (torch.zeros(1, 1, hidden_size), torch.zeros(1, 1, hidden_size)) # Initial hidden and cell states

    for _ in range(len(word) - 1):
        output, hidden_lstm = lstm_model.lstm(test_input, hidden_lstm)
        prediction = lstm_model.fc(output.squeeze(0))
        
        predicted_idx = torch.argmax(prediction).item()
        predicted_char = idx_to_char[predicted_idx]
        generated_word_lstm += predicted_char
        
        test_input = one_hot_encode(predicted_idx, vocab_size).unsqueeze(0).unsqueeze(0)

print(f"Generated word (LSTM, starting with 'h'): {generated_word_lstm}")

Explanation:

  • The SimpleLSTM class is almost identical to SimpleRNN, but it uses nn.LSTM.
  • Crucially, the forward method now initializes both a hidden state h0 and a cell state c0.
  • The self.lstm call now expects a tuple (h0, c0) for its initial states and returns a tuple (h_n, c_n) for its final states.
  • The rest of the training and generation logic remains the same. You’ll likely notice the LSTM performs similarly or slightly better on this tiny example, but its true power shines on much longer and more complex sequences.

Mini-Challenge: Predict a Simple Numeric Sequence

Your turn! Let’s apply what you’ve learned to a numeric sequence.

Challenge: Create a SimpleLSTM model (or adapt the one above) to predict the next number in the sequence [10, 20, 30, 40, 50].

  • Input: A single number (e.g., 10), one-hot encoded if you want to treat it as a categorical feature, or simply normalized if you treat it as a continuous value. For simplicity, let’s treat numbers as continuous values in this challenge, and the output will be a single regression value.
  • Target: The next number in the sequence.
  • Data Preparation: Create input-target pairs like (10, 20), (20, 30), (30, 40), (40, 50).
  • Model Adjustment:
    • input_size will be 1 (for a single continuous number).
    • output_size will be 1 (for a single predicted continuous number).
    • Change the loss function to nn.MSELoss (Mean Squared Error) for regression.
  • Training: Train the model and then try to predict the next few numbers starting from 50.

Hint:

  • Normalize your input numbers (e.g., divide by the max value, or use MinMaxScaler if you prefer). Remember to de-normalize your output for interpretation.
  • For the nn.Linear layer, make sure its in_features is hidden_size and out_features is 1.
  • When creating tensors, ensure they have the correct float type (e.g., torch.tensor(data, dtype=torch.float32)).
  • The input to the LSTM will be (batch_size, sequence_length, input_size). For a single number, this will be (1, 1, 1).

What to observe/learn:

  • How well an LSTM can learn simple numerical patterns.
  • The differences in data preparation and loss function when moving from classification (characters) to regression (numbers).
  • The flexibility of RNNs/LSTMs to handle different types of sequential data.

Common Pitfalls & Troubleshooting

  1. Incorrect Tensor Shapes: This is the most common issue when working with RNNs.
    • Symptom: RuntimeError: expected input to be 3D, got 2D or size mismatch.
    • Fix: Always double-check the expected input shape for your RNN layer: (batch_size, sequence_length, input_size) when batch_first=True. Use unsqueeze() or view() to add/remove dimensions as needed. Remember h0 and c0 also have specific shapes.
  2. Not Detaching Hidden States (Manual RNNs): If you’re manually managing hidden states across separate training iterations (e.g., for very long sequences where you process chunks), you might forget to detach() the hidden state from the computation graph.
    • Symptom: Gradients accumulating indefinitely, leading to memory errors or incorrect updates.
    • Fix: After each forward pass and before starting the next, call hidden_state.detach() on the hidden state you intend to carry over. PyTorch’s nn.RNN and nn.LSTM modules handle this automatically within their forward pass if you pass the hidden state as an argument, but it’s crucial if you’re building custom loops.
  3. Vanishing/Exploding Gradients (Even with LSTMs/GRUs): While LSTMs and GRUs are designed to mitigate these, they can still occur, especially with very deep networks or poorly chosen learning rates.
    • Symptom: Loss becoming NaN (Not a Number) or not decreasing, or extremely slow training.
    • Fix:
      • Gradient Clipping: A common technique is to clip gradients to a maximum value. torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=X) can be called after loss.backward() and before optimizer.step().
      • Learning Rate Adjustment: A smaller learning rate can help.
      • Initialization: Proper weight initialization can also play a role.

Summary

Phew! You’ve just unlocked a powerful new class of neural networks! Here’s a quick recap of what we covered:

  • The Need for RNNs: Traditional feedforward networks struggle with sequential data due to their lack of memory and fixed input size.
  • Basic RNNs: Introduce a “recurrent” connection that allows information (via a hidden state) to flow from one time step to the next, enabling memory and parameter sharing.
  • Vanishing/Exploding Gradients: A major limitation of basic RNNs, making it difficult to learn long-term dependencies.
  • LSTMs (Long Short-Term Memory): A sophisticated RNN variant with a cell state and three gates (forget, input, output) that effectively solve the vanishing gradient problem, allowing them to learn very long-term dependencies.
  • GRUs (Gated Recurrent Units): A simpler, more computationally efficient alternative to LSTMs, using an update gate and a reset gate. Often performs comparably to LSTMs.
  • PyTorch Implementation: We walked through how to prepare sequence data, build nn.RNN and nn.LSTM models, and train them for character-level prediction.

RNNs, LSTMs, and GRUs have been foundational for breakthroughs in areas like natural language processing and time series analysis. While newer architectures like Transformers (which we’ll explore later) have gained prominence, understanding recurrent networks is crucial for any AI/ML engineer, providing a solid foundation for processing sequential information.

In the next chapter, we’ll expand our deep learning toolkit even further by delving into Convolutional Neural Networks (CNNs), which are the cornerstone of modern computer vision!

References

  1. PyTorch Documentation - torch.nn.RNN: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html
  2. PyTorch Documentation - torch.nn.LSTM: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html
  3. PyTorch Documentation - torch.nn.GRU: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html
  4. Colah’s Blog - Understanding LSTMs (Highly Recommended Read): https://colah.github.io/posts/2015-08-Understanding-LSTMs/
  5. PyTorch Installation Guide: https://pytorch.org/get-started/locally/

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.