Welcome back, future AI engineer! In our previous chapters, we mastered the fundamentals of deep learning with feedforward neural networks (FNNs). We learned how these networks excel at tasks where inputs are independent and fixed in size, like classifying images or predicting a single value from a structured dataset.
But what happens when the order of your data matters? What if your input isn’t a single, fixed-size vector, but a sequence of varying length, where each element’s meaning is influenced by what came before it? Think about natural language, where the meaning of a word depends on the preceding words, or time series data, where future values are influenced by past observations. Traditional FNNs hit a wall here because they lack “memory” and treat each input independently.
This chapter introduces you to the fascinating world of Recurrent Neural Networks (RNNs). You’ll discover how RNNs are designed to process sequential data by maintaining an internal “state” or “memory” that captures information from previous steps in the sequence. We’ll start with the basic RNN architecture, understand its limitations like vanishing gradients, and then explore more sophisticated variants like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs) that overcome these challenges. By the end, you’ll be able to implement these powerful models using PyTorch, preparing you for real-world tasks in areas like natural language processing, speech recognition, and time series forecasting.
Ready to add the power of memory to your neural networks? Let’s dive in!
The Challenge of Sequence Data
Imagine you’re trying to predict the next word in a sentence: “The cat sat on the ___.” To fill in that blank, you need to remember “cat” and “sat” to infer that “mat” or “rug” are likely candidates. A standard FNN, which takes a fixed-size input and produces a fixed-size output, struggles with this. It doesn’t inherently understand the concept of order or carry information from one input to the next.
Here’s why FNNs fall short with sequences:
- Fixed Input Size: FNNs require a predefined number of input features. How would you represent a sentence that could be 3 words long or 30 words long? Padding shorter sequences or truncating longer ones can lose valuable information.
- No Memory: Each input to an FNN is processed independently. There’s no mechanism for the network to remember previous inputs in a sequence, which is crucial for understanding context.
- Lack of Parameter Sharing: If you tried to process each word in a sentence with a separate FNN, you’d end up with a huge number of parameters, and the network wouldn’t generalize well across different positions in a sequence.
Introducing Recurrent Neural Networks (RNNs)
RNNs solve these problems by introducing a “loop” or “recurrence” in their architecture. This loop allows information to persist from one step of the sequence to the next. Think of it as the network having a short-term memory.
How a Basic RNN Works
At its core, an RNN processes a sequence element by element, and at each step, it takes two inputs:
- The current element from the input sequence.
- The “hidden state” from the previous time step.
It then produces two outputs:
- An output for the current time step (e.g., a prediction).
- An updated hidden state that is passed to the next time step.
This hidden state is where the “memory” lives. It’s a vector that encapsulates information gathered from all previous elements in the sequence.
To better visualize this, let’s “unroll” the RNN over time. This shows how the network processes each element of the sequence sequentially, passing its internal state forward.
In this diagram:
X0, X1, X2are the input elements at different time steps.H0, H1, H2, H3are the hidden states.H0is usually initialized to zeros.Y0, Y1, Y2are the outputs at different time steps.- Notice that the
RNNblock (representing the recurrent layer) is the same block at each time step. This means the same set of weights is used across the entire sequence, which is a powerful form of parameter sharing.
The calculation for the hidden state h_t and output y_t at time t typically looks something like this:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y
Where:
x_tis the input at timet.h_tis the hidden state at timet.W_hh,W_xh,W_hyare weight matrices.b_h,b_yare bias vectors.tanhis a common activation function (like ReLU, but often preferred in RNNs for its output range).
The Vanishing/Exploding Gradient Problem
While basic RNNs are conceptually elegant, they suffer from a significant practical limitation, especially when dealing with long sequences: the vanishing or exploding gradient problem.
During backpropagation through time (BPTT), gradients are calculated by multiplying them back through the network’s layers at each time step.
- Vanishing Gradients: If the weights in the recurrent connections are small, these gradients can shrink exponentially as they propagate backward through many time steps. This means that information from early parts of a long sequence effectively gets “forgotten” by the time it reaches later parts, making it hard for the network to learn long-term dependencies.
- Exploding Gradients: Conversely, if the weights are large, the gradients can grow exponentially, leading to unstable training, numerical overflow, and wildly fluctuating updates.
This problem makes training plain RNNs on complex, long sequences very difficult. Thankfully, researchers developed more advanced architectures to mitigate this.
Long Short-Term Memory (LSTM) Networks
LSTMs are a special kind of RNN designed specifically to overcome the vanishing gradient problem and learn long-term dependencies. They do this by introducing a sophisticated internal mechanism called a cell state and several gates that regulate the flow of information.
Think of the cell state as a conveyor belt that runs through the entire sequence. Information can be added to it or removed from it, but it flows relatively unchanged. The gates are neural networks that decide what information to keep, what to discard, and what to output.
The three main gates in an LSTM unit are:
- Forget Gate: Decides what information from the previous cell state
C_{t-1}should be thrown away. - Input Gate: Decides what new information from the current input
x_tand previous hidden stateh_{t-1}should be stored in the cell stateC_t. - Output Gate: Decides what parts of the cell state
C_tshould be output as the current hidden stateh_t.
By carefully controlling these gates, LSTMs can selectively remember or forget information over very long sequences, making them incredibly powerful for tasks like machine translation, speech recognition, and complex time series prediction.
Gated Recurrent Unit (GRU) Networks
GRUs are another popular variant of RNNs, similar to LSTMs but with a simpler architecture. They were introduced to provide a more computationally efficient alternative while still addressing the vanishing gradient problem.
Instead of three gates and a separate cell state, GRUs combine the forget and input gates into a single update gate and also feature a reset gate. They merge the hidden state and cell state into a single “hidden state.”
- Update Gate: Controls how much of the previous hidden state should be carried over to the current hidden state.
- Reset Gate: Controls how much of the previous hidden state should be “forgotten” when computing the new candidate hidden state.
GRUs typically have fewer parameters than LSTMs, which can lead to faster training and less data required to generalize well. For many tasks, their performance is comparable to LSTMs, making them an excellent choice when computational resources are a concern or when simpler models are preferred.
Step-by-Step Implementation with PyTorch
Let’s get our hands dirty and implement a simple RNN and then an LSTM using PyTorch. We’ll use a very basic example: predicting the next character in a sequence.
First, ensure you have PyTorch installed. As of January 2026, PyTorch 2.x is the stable release, offering significant performance improvements.
# Verify Python version (e.g., 3.10 or newer)
python --version
# Install PyTorch (example for CPU, adjust for GPU if needed)
# For the most up-to-date instructions, always check the official PyTorch website.
# Example for PyTorch 2.x (CPU only)
pip install torch==2.* torchvision==0.* torchaudio==0.* --index-url https://download.pytorch.org/whl/cpu
Make sure to replace 2.* 0.* with the exact latest stable versions if they differ slightly from what’s implied here, for example 2.3.0 and 0.18.0.
1. Preparing Sequence Data (Character-Level)
We’ll create a simple dataset to train our RNN. Let’s try to predict the next character in the word “hello”.
import torch
import torch.nn as nn
import torch.optim as optim
# Our simple sequence
word = "hello"
chars = sorted(list(set(word))) # Get unique characters and sort them
char_to_idx = {char: i for i, char in enumerate(chars)}
idx_to_char = {i: char for i, char in enumerate(chars)}
# Vocabulary size
vocab_size = len(chars)
print(f"Vocabulary: {chars}")
print(f"Char to Index: {char_to_idx}")
print(f"Index to Char: {idx_to_char}")
print(f"Vocabulary Size: {vocab_size}")
This snippet defines our vocabulary and creates mappings between characters and their integer indices. This is a fundamental step for any text-based task in deep learning.
Next, we’ll prepare our input-output pairs. For “hello”:
- Input: “h”, Target: “e”
- Input: “he”, Target: “l”
- Input: “hel”, Target: “l”
- Input: “hell”, Target: “o”
Each input character will be one-hot encoded.
def one_hot_encode(idx, vocab_size):
# Creates a one-hot vector for a given index
vec = torch.zeros(vocab_size)
vec[idx] = 1
return vec
# Create training data
X = [] # Inputs
y = [] # Targets
for i in range(len(word) - 1):
input_char_idx = char_to_idx[word[i]]
target_char_idx = char_to_idx[word[i+1]]
# For a basic RNN, we'll feed one character at a time initially
# For sequence prediction, input is one-hot of current char, target is index of next char
X.append(one_hot_encode(input_char_idx, vocab_size))
y.append(target_char_idx)
# Stack them into tensors
X_tensor = torch.stack(X).unsqueeze(0) # Add batch dimension (batch_size=1)
y_tensor = torch.tensor(y)
print(f"\nInput (X_tensor) shape: {X_tensor.shape}") # Should be (1, sequence_length, vocab_size)
print(f"Target (y_tensor) shape: {y_tensor.shape}") # Should be (sequence_length)
Explanation:
one_hot_encode: This function converts an integer index into a vector where only the position corresponding to the index is 1, and others are 0. This is how categorical data is usually fed into neural networks.Xandylists: We iterate through the word to create pairs.Xwill store the one-hot encoded input character, andywill store the integer index of the target (next) character.unsqueeze(0): PyTorch RNN layers expect input in the format(batch_size, sequence_length, input_size)whenbatch_first=True. Since we have a single sequence, we add a batch dimension of 1.
2. Building a Simple RNN Model
Now, let’s define our basic RNN. We’ll use torch.nn.RNN.
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.hidden_size = hidden_size
# The RNN layer itself
# batch_first=True means input/output tensors are (batch, seq, feature)
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
# A linear layer to map the RNN's output to our desired output size (vocab_size)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, input_seq):
# input_seq shape: (batch_size, sequence_length, input_size)
# Initialize hidden state with zeros
# hidden_state shape: (num_layers * num_directions, batch_size, hidden_size)
# For a simple RNN, num_layers=1, num_directions=1
h0 = torch.zeros(1, input_seq.size(0), self.hidden_size).to(input_seq.device)
# Pass input sequence and initial hidden state through the RNN layer
# output shape: (batch_size, sequence_length, hidden_size)
# h_n shape: (num_layers * num_directions, batch_size, hidden_size) - final hidden state
output, h_n = self.rnn(input_seq, h0)
# We want to predict for each step, so we take the output from all time steps
# Reshape output to (batch_size * sequence_length, hidden_size) for the linear layer
output = output.reshape(-1, self.hidden_size)
# Pass through the final linear layer
output = self.fc(output)
return output
# Model parameters
input_size = vocab_size # Each character is a one-hot vector of vocab_size
hidden_size = 16 # Number of features in the hidden state
output_size = vocab_size # We want to predict one of the vocab_size characters
# Instantiate the model
model = SimpleRNN(input_size, hidden_size, output_size)
print(f"\nSimple RNN Model:\n{model}")
Explanation:
__init__: We define thenn.RNNlayer.input_sizeis the dimension of each input element (our one-hot vector).hidden_sizeis the dimension of the hidden state.batch_first=Trueis a common and convenient setting for PyTorch RNNs, meaning the batch dimension comes first. We also add ann.Linearlayer to map the RNN’s hidden state outputs to our desired output dimension (the vocabulary size).forward:h0: We initialize the hidden state to zeros. The shape forh0is(num_layers * num_directions, batch_size, hidden_size). Since we have one layer and it’s unidirectional, it’s(1, batch_size, hidden_size).self.rnn(input_seq, h0): This is the core call. It returns two things:output(the hidden state at each time step, after passing through the non-linearity) andh_n(the final hidden state of the last layer for the last time step).output.reshape(-1, self.hidden_size): We want to predict a character at each step. Theoutputtensor fromrnnhas shape(batch_size, sequence_length, hidden_size). We flatten thebatch_sizeandsequence_lengthdimensions so that each hidden state from each time step can be passed independently through the finalfclayer to produce a prediction.
3. Training the Simple RNN
Let’s train our SimpleRNN to predict the next character.
# Loss function and optimizer
criterion = nn.CrossEntropyLoss() # Suitable for multi-class classification
optimizer = optim.Adam(model.parameters(), lr=0.01)
# Training loop
num_epochs = 100
print("\nTraining Simple RNN...")
for epoch in range(num_epochs):
model.train() # Set model to training mode
optimizer.zero_grad() # Clear gradients
outputs = model(X_tensor) # Forward pass
loss = criterion(outputs, y_tensor) # Calculate loss
loss.backward() # Backward pass
optimizer.step() # Update weights
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
print("Training finished.")
# Let's see how well it learned
model.eval() # Set model to evaluation mode
with torch.no_grad():
test_input = one_hot_encode(char_to_idx['h'], vocab_size).unsqueeze(0).unsqueeze(0) # (1, 1, vocab_size)
# Simulate sequence generation for "hello"
generated_word = "h"
hidden = torch.zeros(1, 1, hidden_size) # Initial hidden state
# To predict the next character, we need to pass the current predicted character back as input
# This is a simplified generation loop just for demonstration
# Real generation would involve feeding the model its own predictions
for _ in range(len(word) - 1):
output, hidden = model.rnn(test_input, hidden)
prediction = model.fc(output.squeeze(0)) # Remove batch and sequence dim
predicted_idx = torch.argmax(prediction).item()
predicted_char = idx_to_char[predicted_idx]
generated_word += predicted_char
# Prepare next input: one-hot of the predicted character
test_input = one_hot_encode(predicted_idx, vocab_size).unsqueeze(0).unsqueeze(0)
print(f"Generated word (starting with 'h'): {generated_word}")
Explanation:
- Training Loop: Standard PyTorch training loop. We use
nn.CrossEntropyLossbecause our task is effectively a multi-class classification problem at each time step (predicting which of thevocab_sizecharacters is next).optim.Adamis a robust choice for optimization. - Prediction/Generation: After training, we evaluate the model. We start with ‘h’, feed it to the RNN, get a prediction, then feed that predicted character back as the next input, and so on. This demonstrates the sequence generation capability. Note that for a single character input, we need to add both batch and sequence length dimensions.
4. Building an LSTM Model
Now, let’s upgrade our SimpleRNN to use an LSTM. The structure will be very similar, but we’ll use torch.nn.LSTM instead.
class SimpleLSTM(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleLSTM, self).__init__()
self.hidden_size = hidden_size
# The LSTM layer itself
self.lstm = nn.LSTM(input_size, hidden_size, batch_first=True)
# A linear layer to map the LSTM's output to our desired output size
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, input_seq):
# input_seq shape: (batch_size, sequence_length, input_size)
# Initialize hidden state (h0) and cell state (c0) with zeros
# Both shapes: (num_layers * num_directions, batch_size, hidden_size)
h0 = torch.zeros(1, input_seq.size(0), self.hidden_size).to(input_seq.device)
c0 = torch.zeros(1, input_seq.size(0), self.hidden_size).to(input_seq.device)
# Pass input sequence and initial states through the LSTM layer
# output shape: (batch_size, sequence_length, hidden_size)
# (h_n, c_n) are the final hidden and cell states
output, (h_n, c_n) = self.lstm(input_seq, (h0, c0))
# Reshape output for the linear layer
output = output.reshape(-1, self.hidden_size)
# Pass through the final linear layer
output = self.fc(output)
return output
# Instantiate the LSTM model
lstm_model = SimpleLSTM(input_size, hidden_size, output_size)
print(f"\nSimple LSTM Model:\n{lstm_model}")
# Loss function and optimizer for LSTM
lstm_criterion = nn.CrossEntropyLoss()
lstm_optimizer = optim.Adam(lstm_model.parameters(), lr=0.01)
# Training loop for LSTM
print("\nTraining Simple LSTM...")
for epoch in range(num_epochs):
lstm_model.train()
lstm_optimizer.zero_grad()
outputs = lstm_model(X_tensor)
loss = lstm_criterion(outputs, y_tensor)
loss.backward()
lstm_optimizer.step()
if (epoch + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')
print("LSTM Training finished.")
# LSTM generation
lstm_model.eval()
with torch.no_grad():
test_input = one_hot_encode(char_to_idx['h'], vocab_size).unsqueeze(0).unsqueeze(0)
generated_word_lstm = "h"
hidden_lstm = (torch.zeros(1, 1, hidden_size), torch.zeros(1, 1, hidden_size)) # Initial hidden and cell states
for _ in range(len(word) - 1):
output, hidden_lstm = lstm_model.lstm(test_input, hidden_lstm)
prediction = lstm_model.fc(output.squeeze(0))
predicted_idx = torch.argmax(prediction).item()
predicted_char = idx_to_char[predicted_idx]
generated_word_lstm += predicted_char
test_input = one_hot_encode(predicted_idx, vocab_size).unsqueeze(0).unsqueeze(0)
print(f"Generated word (LSTM, starting with 'h'): {generated_word_lstm}")
Explanation:
- The
SimpleLSTMclass is almost identical toSimpleRNN, but it usesnn.LSTM. - Crucially, the
forwardmethod now initializes both a hidden stateh0and a cell statec0. - The
self.lstmcall now expects a tuple(h0, c0)for its initial states and returns a tuple(h_n, c_n)for its final states. - The rest of the training and generation logic remains the same. You’ll likely notice the LSTM performs similarly or slightly better on this tiny example, but its true power shines on much longer and more complex sequences.
Mini-Challenge: Predict a Simple Numeric Sequence
Your turn! Let’s apply what you’ve learned to a numeric sequence.
Challenge:
Create a SimpleLSTM model (or adapt the one above) to predict the next number in the sequence [10, 20, 30, 40, 50].
- Input: A single number (e.g.,
10), one-hot encoded if you want to treat it as a categorical feature, or simply normalized if you treat it as a continuous value. For simplicity, let’s treat numbers as continuous values in this challenge, and the output will be a single regression value. - Target: The next number in the sequence.
- Data Preparation: Create input-target pairs like
(10, 20), (20, 30), (30, 40), (40, 50). - Model Adjustment:
input_sizewill be 1 (for a single continuous number).output_sizewill be 1 (for a single predicted continuous number).- Change the loss function to
nn.MSELoss(Mean Squared Error) for regression.
- Training: Train the model and then try to predict the next few numbers starting from
50.
Hint:
- Normalize your input numbers (e.g., divide by the max value, or use
MinMaxScalerif you prefer). Remember to de-normalize your output for interpretation. - For the
nn.Linearlayer, make sure itsin_featuresishidden_sizeandout_featuresis 1. - When creating tensors, ensure they have the correct
floattype (e.g.,torch.tensor(data, dtype=torch.float32)). - The input to the LSTM will be
(batch_size, sequence_length, input_size). For a single number, this will be(1, 1, 1).
What to observe/learn:
- How well an LSTM can learn simple numerical patterns.
- The differences in data preparation and loss function when moving from classification (characters) to regression (numbers).
- The flexibility of RNNs/LSTMs to handle different types of sequential data.
Common Pitfalls & Troubleshooting
- Incorrect Tensor Shapes: This is the most common issue when working with RNNs.
- Symptom:
RuntimeError: expected input to be 3D, got 2Dorsize mismatch. - Fix: Always double-check the expected input shape for your RNN layer:
(batch_size, sequence_length, input_size)whenbatch_first=True. Useunsqueeze()orview()to add/remove dimensions as needed. Rememberh0andc0also have specific shapes.
- Symptom:
- Not Detaching Hidden States (Manual RNNs): If you’re manually managing hidden states across separate training iterations (e.g., for very long sequences where you process chunks), you might forget to
detach()the hidden state from the computation graph.- Symptom: Gradients accumulating indefinitely, leading to memory errors or incorrect updates.
- Fix: After each forward pass and before starting the next, call
hidden_state.detach()on the hidden state you intend to carry over. PyTorch’snn.RNNandnn.LSTMmodules handle this automatically within theirforwardpass if you pass the hidden state as an argument, but it’s crucial if you’re building custom loops.
- Vanishing/Exploding Gradients (Even with LSTMs/GRUs): While LSTMs and GRUs are designed to mitigate these, they can still occur, especially with very deep networks or poorly chosen learning rates.
- Symptom: Loss becoming
NaN(Not a Number) or not decreasing, or extremely slow training. - Fix:
- Gradient Clipping: A common technique is to clip gradients to a maximum value.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=X)can be called afterloss.backward()and beforeoptimizer.step(). - Learning Rate Adjustment: A smaller learning rate can help.
- Initialization: Proper weight initialization can also play a role.
- Gradient Clipping: A common technique is to clip gradients to a maximum value.
- Symptom: Loss becoming
Summary
Phew! You’ve just unlocked a powerful new class of neural networks! Here’s a quick recap of what we covered:
- The Need for RNNs: Traditional feedforward networks struggle with sequential data due to their lack of memory and fixed input size.
- Basic RNNs: Introduce a “recurrent” connection that allows information (via a hidden state) to flow from one time step to the next, enabling memory and parameter sharing.
- Vanishing/Exploding Gradients: A major limitation of basic RNNs, making it difficult to learn long-term dependencies.
- LSTMs (Long Short-Term Memory): A sophisticated RNN variant with a cell state and three gates (forget, input, output) that effectively solve the vanishing gradient problem, allowing them to learn very long-term dependencies.
- GRUs (Gated Recurrent Units): A simpler, more computationally efficient alternative to LSTMs, using an update gate and a reset gate. Often performs comparably to LSTMs.
- PyTorch Implementation: We walked through how to prepare sequence data, build
nn.RNNandnn.LSTMmodels, and train them for character-level prediction.
RNNs, LSTMs, and GRUs have been foundational for breakthroughs in areas like natural language processing and time series analysis. While newer architectures like Transformers (which we’ll explore later) have gained prominence, understanding recurrent networks is crucial for any AI/ML engineer, providing a solid foundation for processing sequential information.
In the next chapter, we’ll expand our deep learning toolkit even further by delving into Convolutional Neural Networks (CNNs), which are the cornerstone of modern computer vision!
References
- PyTorch Documentation -
torch.nn.RNN: https://pytorch.org/docs/stable/generated/torch.nn.RNN.html - PyTorch Documentation -
torch.nn.LSTM: https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html - PyTorch Documentation -
torch.nn.GRU: https://pytorch.org/docs/stable/generated/torch.nn.GRU.html - Colah’s Blog - Understanding LSTMs (Highly Recommended Read): https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- PyTorch Installation Guide: https://pytorch.org/get-started/locally/
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.