Introduction

Welcome to Chapter 21! After exploring the theoretical foundations of deep learning, neural networks, and various architectures, it’s time to get your hands dirty with a complete, practical project. In this chapter, we’ll build a custom image classifier from scratch, leveraging the power of modern deep learning frameworks and techniques.

This project will guide you through the entire lifecycle of an image classification task: from preparing your own dataset, to selecting and modifying a pre-trained model, training it, and evaluating its performance. By the end, you’ll not only have a working image classifier but also a much deeper understanding of the practical considerations involved in real-world deep learning applications. This is a foundational skill for any aspiring AI/ML engineer or researcher, opening doors to advanced computer vision tasks.

Before we dive in, ensure you’re comfortable with Python programming, basic machine learning concepts, and the fundamentals of deep learning, including neural networks and convolutional layers, as covered in previous chapters. We’ll be using PyTorch, one of the leading deep learning frameworks, so a basic familiarity with its tensors and operations will be beneficial, though we’ll explain each step.

Understanding Image Classification

At its core, image classification is the task of assigning a label or category to an entire image. For example, given an image, a classifier might tell us if it contains a “cat,” a “dog,” or an “airplane.” This seemingly simple task is a cornerstone of many advanced AI applications, from self-driving cars recognizing pedestrians to medical imaging systems detecting diseases.

How do machines “see” and classify images? Unlike humans, who perceive objects holistically, computers process images as grids of pixel values. Deep learning, particularly with Convolutional Neural Networks (CNNs), provides a powerful way for machines to learn hierarchical features from these pixel values. CNNs can automatically detect edges, textures, shapes, and eventually entire objects, forming increasingly complex representations as data passes through their layers.

The Problem: Limited Data and Training Time

Training a powerful CNN from scratch requires massive datasets and significant computational resources. What if you only have a few hundred or thousand images for your specific classification task? This is where a technique called Transfer Learning becomes incredibly valuable.

The Power of Transfer Learning

Imagine you’ve spent years learning to identify various animals. Now, if someone asks you to identify a new breed of dog you’ve never seen before, you don’t start from scratch. Instead, you leverage your existing knowledge of what makes a “dog” a “dog” (ears, snout, fur, etc.) and adapt it to the new breed. Transfer learning in deep learning works similarly.

Transfer learning is a technique where a model trained on a large, general dataset (like ImageNet, which contains millions of images across 1000 categories) is repurposed for a new, often smaller, related task. The idea is that the features learned by the model on the large dataset (e.g., detecting edges, corners, textures, and even parts of objects) are generic and useful for many computer vision tasks.

There are two primary ways to apply transfer learning:

  1. Feature Extractor: You take a pre-trained CNN, remove its final classification layer, and use the rest of the network as a fixed feature extractor. The features extracted are then fed into a new, smaller classifier (e.g., a simple fully connected layer) that you train from scratch on your specific dataset. This is efficient when your dataset is small and very different from the pre-training dataset.
  2. Fine-tuning: You take a pre-trained CNN and replace its final classification layer, just like with a feature extractor. However, instead of freezing all the pre-trained layers, you unfreeze some or all of them and continue training the entire network (or parts of it) on your new dataset, usually with a very low learning rate. This allows the model to adapt its learned features more closely to your specific data, often leading to better performance, especially if your dataset is larger and similar to the pre-training data.

For this project, we’ll focus on a common and highly effective approach: using a pre-trained model as a feature extractor and then fine-tuning its final layers.

Choosing Our Tools: PyTorch

For this project, we’ll be using PyTorch, a powerful open-source machine learning framework developed by Facebook (now Meta). PyTorch is known for its flexibility, Python-friendly interface, and dynamic computational graph, which makes debugging and experimentation intuitive.

As of January 2026, the latest stable version of PyTorch is 2.4.0 (or similar, depending on release cadence), building upon the strong foundations of PyTorch 2.x with continued improvements in performance, compiler optimizations, and distributed training capabilities. Its torchvision library provides convenient access to popular datasets, model architectures, and image transformations, making it ideal for computer vision tasks.

Dataset Considerations: Custom Data

For any image classification project, your data is paramount. We’ll simulate a “custom” dataset. For this project, our dataset will be organized in a standard way that torchvision.datasets.ImageFolder can easily understand:

your_custom_dataset/
├── class_a/
│   ├── image1.jpg
│   ├── image2.png
│   └── ...
├── class_b/
│   ├── imageA.jpeg
│   ├── imageB.jpg
│   └── ...
└── ...

Each subfolder name (class_a, class_b) will automatically become a class label.

Step-by-Step Implementation

Let’s get started! We’ll build our custom image classifier piece by piece.

Step 1: Setting Up Your Environment

First, open your terminal or command prompt. We need to install PyTorch and torchvision. For optimal performance, especially with deep learning, a GPU (Graphics Processing Unit) is highly recommended. If you have an NVIDIA GPU, ensure you have CUDA installed.

# For NVIDIA GPU users (e.g., CUDA 12.1, common for PyTorch 2.x in 2026)
# Check PyTorch's official website for the exact command if your CUDA version differs.
# As of 2026-01-17, PyTorch 2.4.0 is a stable target.
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121

# For CPU-only users (if you don't have a compatible GPU or prefer CPU)
# This will install the CPU version of PyTorch.
# pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cpu

Explanation:

  • pip install: This is the standard Python package installer.
  • torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0: We’re specifying exact versions to ensure consistency. These versions are chosen as likely stable releases for January 2026, compatible with the PyTorch 2.x ecosystem. Always refer to the official PyTorch installation guide for the most up-to-date and precise commands for your specific system and CUDA version.
  • --index-url https://download.pytorch.org/whl/cu121: This flag tells pip to download the packages from a specific index, in this case, the one containing CUDA 12.1 compatible binaries. If you’re on CPU, you’d use /whl/cpu.

Next, let’s create a Python script named image_classifier.py.

Step 2: Preparing Our Custom Dataset

For this project, we’ll create a dummy dataset structure. In a real scenario, you would replace these with your actual image files.

Action: Create a directory structure like this in the same folder as your image_classifier.py:

data/
├── train/
│   ├── cat/
│   │   ├── cat_001.jpg
│   │   ├── cat_002.jpg
│   │   └── ... (add a few more dummy images, even placeholders)
│   └── dog/
│       ├── dog_001.jpg
│       ├── dog_002.jpg
│       └── ... (add a few more dummy images)
└── val/
    ├── cat/
    │   ├── cat_003.jpg
    │   └── ...
    └── dog/
        ├── dog_003.jpg
        └── ...

You can use any small image files you have, or even create empty .jpg files as placeholders for this exercise. Just make sure there are at least 2-3 images per class in both train and val directories.

Now, let’s add the code to load and preprocess this data.

# image_classifier.py

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, models, transforms
from torch.utils.data import DataLoader
import os
import time
import copy

# 1. Define device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# 2. Define data transformations
# These transformations ensure images are consistent (resized, converted to tensor)
# and normalized according to ImageNet's mean and standard deviation,
# which is crucial for pre-trained models.
data_transforms = {
    'train': transforms.Compose([
        transforms.RandomResizedCrop(224), # Randomly crop and resize to 224x224
        transforms.RandomHorizontalFlip(), # Randomly flip the image horizontally
        transforms.ToTensor(),             # Convert image to PyTorch Tensor
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]) # Normalize ImageNet stats
    ]),
    'val': transforms.Compose([
        transforms.Resize(256),             # Resize the image to 256x256
        transforms.CenterCrop(224),         # Crop the center to 224x224
        transforms.ToTensor(),
        transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
    ]),
}

# 3. Load datasets
data_dir = 'data' # Make sure your 'data' folder is in the same directory as this script
image_datasets = {x: datasets.ImageFolder(os.path.join(data_dir, x),
                                          data_transforms[x])
                  for x in ['train', 'val']}

# 4. Create data loaders
# DataLoader provides an iterable over the dataset, handling batching, shuffling, etc.
dataloaders = {x: DataLoader(image_datasets[x], batch_size=4,
                             shuffle=True, num_workers=2) # num_workers can be adjusted based on system
               for x in ['train', 'val']}
dataset_sizes = {x: len(image_datasets[x]) for x in ['train', 'val']}
class_names = image_datasets['train'].classes

print(f"Detected classes: {class_names}")
print(f"Training dataset size: {dataset_sizes['train']}")
print(f"Validation dataset size: {dataset_sizes['val']}")

Explanation:

  1. device: This line checks if a CUDA-enabled GPU is available. If so, it uses cuda:0 (the first GPU); otherwise, it defaults to cpu. Moving computations to the GPU significantly speeds up training.
  2. data_transforms: This dictionary defines how our images will be preprocessed.
    • transforms.Compose: Chains multiple transformations together.
    • transforms.RandomResizedCrop(224) / transforms.Resize(256) & transforms.CenterCrop(224): Ensures all images are resized to a consistent 224x224 pixels, which is the input size expected by many pre-trained models. Random operations are good for training to add variability.
    • transforms.RandomHorizontalFlip(): A common data augmentation technique that randomly flips images horizontally, helping the model generalize better.
    • transforms.ToTensor(): Converts the image from a PIL Image or NumPy array to a PyTorch Tensor. It also scales pixel values to the range [0.0, 1.0].
    • transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]): Normalizes the image tensors using the mean and standard deviation of the ImageNet dataset. This is crucial because our pre-trained model was trained on ImageNet with this normalization.
  3. image_datasets: datasets.ImageFolder is a super convenient utility from torchvision that automatically loads images from a directory structure where subfolders represent classes. It applies the defined transformations.
  4. dataloaders: DataLoader wraps the ImageFolder dataset, providing an efficient way to iterate over batches of images during training. batch_size determines how many images are processed at once, shuffle=True shuffles the data for each epoch (important for training), and num_workers specifies how many subprocesses to use for data loading (speeds up I/O).
  5. dataset_sizes and class_names: We extract the total number of images in each split and the names of the detected classes.

Step 3: Loading a Pre-trained Model

Now, let’s load a pre-trained model and modify its final layer for our specific classification task. We’ll use resnet18, a relatively small but effective Convolutional Neural Network.

# image_classifier.py (continue appending to the file)

# 5. Load a pre-trained model
model_ft = models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1)

# Get the number of input features for the last fully connected layer
num_ftrs = model_ft.fc.in_features

# Replace the last layer with a new one that has 'len(class_names)' output features
# This new layer will be trained from scratch for our specific classes.
model_ft.fc = nn.Linear(num_ftrs, len(class_names))

# Move the model to the chosen device (GPU or CPU)
model_ft = model_ft.to(device)

print(f"Model architecture modified. Final classification layer now has {len(class_names)} outputs.")

Explanation:

  1. models.resnet18(weights=models.ResNet18_Weights.IMAGENET1K_V1): This line loads the resnet18 architecture. Crucially, weights=models.ResNet18_Weights.IMAGENET1K_V1 tells PyTorch to download and load the weights pre-trained on the ImageNet-1K dataset. This is the core of transfer learning!
  2. num_ftrs = model_ft.fc.in_features: We inspect the last layer of the ResNet model (model_ft.fc). This layer is typically a fully connected (linear) layer that outputs 1000 classes (for ImageNet). We get the number of input features this layer expects.
  3. model_ft.fc = nn.Linear(num_ftrs, len(class_names)): We replace the original final fully connected layer with a new one. The input features remain the same (num_ftrs), but the output features are now len(class_names), which is the number of classes in our custom dataset (e.g., 2 for “cat” and “dog”). This new layer’s weights will be randomly initialized and will be the primary focus of our initial training.
  4. model_ft = model_ft.to(device): We move the entire model to the specified device (GPU if available, otherwise CPU).

Step 4: Defining Loss Function and Optimizer

For training, we need a way to measure how “wrong” our model’s predictions are (loss function) and a strategy to adjust the model’s weights to reduce that error (optimizer).

# image_classifier.py (continue appending to the file)

# 6. Define Loss Function and Optimizer
criterion = nn.CrossEntropyLoss() # Suitable for multi-class classification

# Observe that all parameters are being optimized since we haven't frozen any layers yet.
# However, the newly initialized `model_ft.fc` layer will have much larger gradients
# initially and will learn faster.
optimizer_ft = optim.Adam(model_ft.parameters(), lr=0.001)

# Optionally, you can set up a learning rate scheduler to reduce the learning rate
# as training progresses, which can help achieve better convergence.
# Here, we reduce the learning rate by a factor of 0.1 every 7 epochs.
exp_lr_scheduler = optim.lr_scheduler.StepLR(optimizer_ft, step_size=7, gamma=0.1)

print("Loss function (CrossEntropyLoss) and Optimizer (Adam) configured.")

Explanation:

  1. criterion = nn.CrossEntropyLoss(): CrossEntropyLoss is a common and effective loss function for multi-class classification problems. It combines LogSoftmax and NLLLoss in one single class, which is numerically stable.
  2. optimizer_ft = optim.Adam(model_ft.parameters(), lr=0.001): We use the Adam optimizer, a popular choice known for its efficiency and good performance. model_ft.parameters() tells the optimizer which parameters (weights and biases) in our model it should update. lr=0.001 sets the initial learning rate.
  3. exp_lr_scheduler = optim.lr_scheduler.StepLR(...): A learning rate scheduler dynamically adjusts the learning rate during training. StepLR decreases the learning rate by a factor of gamma (here, 0.1) every step_size (here, 7) epochs. This often helps the model to converge more effectively by taking larger steps initially and smaller, more precise steps later.

Step 5: Training the Model

This is the core of the learning process. We’ll define a training function that iterates over our data, makes predictions, calculates loss, and updates the model’s weights.

# image_classifier.py (continue appending to the file)

# 7. Training function
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
    since = time.time()

    best_model_wts = copy.deepcopy(model.state_dict())
    best_acc = 0.0

    for epoch in range(num_epochs):
        print(f'Epoch {epoch}/{num_epochs - 1}')
        print('-' * 10)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                model.train()  # Set model to training mode
            else:
                model.eval()   # Set model to evaluate mode

            running_loss = 0.0
            running_corrects = 0

            # Iterate over data.
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)

                # Zero the parameter gradients
                optimizer.zero_grad()

                # Forward pass
                # Track gradients only in train phase
                with torch.set_grad_enabled(phase == 'train'):
                    outputs = model(inputs)
                    _, preds = torch.max(outputs, 1) # Get the predicted class
                    loss = criterion(outputs, labels)

                    # Backward pass + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()      # Compute gradients
                        optimizer.step()     # Update model parameters

                # Statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)

            if phase == 'train':
                scheduler.step() # Update learning rate scheduler

            epoch_loss = running_loss / dataset_sizes[phase]
            epoch_acc = running_corrects.double() / dataset_sizes[phase]

            print(f'{phase} Loss: {epoch_loss:.4f} Acc: {epoch_acc:.4f}')

            # Deep copy the model if it's the best validation accuracy so far
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_wts = copy.deepcopy(model.state_dict())

        print()

    time_elapsed = time.time() - since
    print(f'Training complete in {time_elapsed // 60:.0f}m {time_elapsed % 60:.0f}s')
    print(f'Best val Acc: {best_acc:.4f}')

    # Load best model weights
    model.load_state_dict(best_model_wts)
    return model

# Start training!
print("Starting model training...")
model_ft = train_model(model_ft, criterion, optimizer_ft, exp_lr_scheduler, num_epochs=10) # Reduced epochs for quick demo
print("Training finished.")

Explanation of train_model function:

  1. Initialization: Records start time, initializes best_model_wts (to save the model with the highest validation accuracy) and best_acc.
  2. Epoch Loop: The outer loop iterates for num_epochs.
  3. Phase Loop (train vs. val): Inside each epoch, there are two phases: ’train’ and ‘val’ (validation).
    • model.train() / model.eval(): Sets the model to training or evaluation mode. This is important because layers like Dropout or BatchNorm behave differently during training and inference.
    • running_loss, running_corrects: Variables to accumulate loss and correct predictions for the current phase.
  4. Data Iteration: The inner loop iterates over batches of data from the dataloaders.
    • inputs = inputs.to(device), labels = labels.to(device): Moves the input images and their corresponding labels to the specified device (GPU or CPU).
    • optimizer.zero_grad(): Clears the gradients of all optimized parameters. Gradients accumulate by default, so you need to zero them out before each backward pass.
    • with torch.set_grad_enabled(phase == 'train'): This context manager enables gradient calculation only if we are in the training phase. For validation, gradients are not needed, saving memory and computation.
    • outputs = model(inputs): Performs the forward pass, feeding inputs through the model to get raw predictions (logits).
    • _, preds = torch.max(outputs, 1): torch.max returns the maximum value and its index. We’re interested in the index, which represents the predicted class.
    • loss = criterion(outputs, labels): Calculates the loss between the model’s outputs and the true labels.
    • loss.backward(): Performs the backward pass, computing gradients for all parameters that require gradients.
    • optimizer.step(): Updates the model’s parameters using the calculated gradients and the optimizer’s algorithm.
  5. scheduler.step(): After each training phase, the learning rate scheduler is updated, potentially decreasing the learning rate.
  6. Statistics: Calculates and prints the average loss and accuracy for the current phase.
  7. Best Model Check: If the current validation accuracy is better than best_acc so far, the model’s current state (weights) is saved as best_model_wts.
  8. Finalization: After all epochs, the model is loaded with the best_model_wts (the one that performed best on validation), and training time is printed.

Step 6: Saving the Model

Once trained, you’ll want to save your model so you can use it later without retraining.

# image_classifier.py (continue appending to the file)

# 8. Save the trained model
model_save_path = 'custom_image_classifier.pth'
torch.save(model_ft.state_dict(), model_save_path)
print(f"Model saved to {model_save_path}")

# To load the model later:
# loaded_model = models.resnet18(weights=None) # Load architecture without pre-trained weights
# num_ftrs_loaded = loaded_model.fc.in_features
# loaded_model.fc = nn.Linear(num_ftrs_loaded, len(class_names))
# loaded_model.load_state_dict(torch.load(model_save_path))
# loaded_model = loaded_model.to(device)
# loaded_model.eval() # Set to evaluation mode for inference
# print("Model loaded successfully for inference.")

Explanation:

  1. torch.save(model_ft.state_dict(), model_save_path): This line saves only the learned parameters (weights and biases) of the model, not the entire model architecture. This is generally preferred as it keeps the file size small and makes it flexible to load into different environments. The file is saved as custom_image_classifier.pth.
  2. Loading Snippet: The commented-out section shows how you would load these saved weights later. You first need to instantiate the model architecture (e.g., resnet18) and then load the state dictionary into it. Remember to call model.eval() after loading for inference.

Congratulations! You’ve just built and trained a custom image classifier using a pre-trained ResNet model and PyTorch.

Mini-Challenge: Experiment with Hyperparameters and Models

Now that you have a working pipeline, it’s time to experiment and deepen your understanding.

Challenge:

  1. Change the Pre-trained Model: Instead of resnet18, try using resnet34 or vgg16 from torchvision.models. You’ll need to adjust the num_ftrs and the final nn.Linear layer accordingly for the new model’s architecture.
  2. Adjust Learning Rate & Epochs: Change the lr in the Adam optimizer (e.g., to 0.01 or 0.0001). Also, increase num_epochs in the train_model function (e.g., to 20 or 30).
  3. Observe the Effects: How do these changes impact the training speed, final accuracy, and validation loss?

Hint:

  • For resnet34, the modification to the fc layer is similar to resnet18.
  • For vgg16, the final classifier is often within model.classifier and might involve multiple nn.Linear layers. You’ll need to replace the last nn.Linear layer in model.classifier with a new one. For example, model_ft.classifier[6] = nn.Linear(model_ft.classifier[6].in_features, len(class_names)). Always inspect the model’s structure by printing print(model_ft) after loading.

What to observe/learn: This challenge will help you understand that deep learning is often an iterative process of experimentation. Different architectures and hyperparameters can significantly affect model performance and training dynamics. You’ll gain intuition about how to debug model behavior by looking at training and validation loss/accuracy curves.

Common Pitfalls & Troubleshooting

Even experienced practitioners encounter issues. Here are a few common ones you might face:

  1. CUDA out of memory: This error occurs when your GPU doesn’t have enough memory to process the current batch of data or the model itself.
    • Solution: Reduce batch_size (e.g., from 4 to 2 or even 1). If that’s not enough, try using a smaller model (e.g., resnet18 instead of resnet50) or use CPU training (slower).
  2. Incorrect Data Paths or Empty Classes: If ImageFolder can’t find your images or if a class folder is empty, you might get errors or unexpected dataset_sizes.
    • Solution: Double-check your data_dir and the exact spelling of subfolder names. Ensure there are actual image files in each class subfolder.
  3. Overfitting: Your training accuracy is very high, but validation accuracy is much lower. The model has memorized the training data but doesn’t generalize well to unseen data.
    • Solution:
      • Increase data augmentation (transforms).
      • Reduce model complexity (e.g., use resnet18 if you were using resnet50).
      • Add regularization techniques (e.g., Dropout layers, weight decay in optimizer).
      • Reduce num_epochs.
      • Get more diverse training data.
  4. Underfitting: Both training and validation accuracy are low. The model isn’t learning enough from the data.
    • Solution:
      • Increase model complexity (e.g., use a deeper ResNet).
      • Increase num_epochs.
      • Increase the learning rate (carefully).
      • Ensure your data is clean and properly labeled.
      • Check if the pre-trained weights are actually being loaded.

Summary

In this chapter, you’ve taken a significant step from theory to practice by building a custom image classifier. Here are the key takeaways:

  • Image Classification Fundamentals: You understand the goal of image classification and how CNNs are leveraged for this task.
  • Transfer Learning: You’ve practically applied transfer learning using a pre-trained resnet18 model, significantly reducing the need for massive datasets and long training times.
  • PyTorch Workflow: You’ve gained hands-on experience with a complete PyTorch workflow, including:
    • Setting up your environment and handling devices (CPU/GPU).
    • Preparing custom datasets with ImageFolder and transforms.
    • Creating efficient data pipelines with DataLoader.
    • Loading and modifying pre-trained models.
    • Defining loss functions (nn.CrossEntropyLoss) and optimizers (optim.Adam).
    • Implementing a full training and validation loop with learning rate scheduling.
    • Saving and loading model weights.
  • Practical Problem Solving: You’ve encountered and considered solutions for common deep learning challenges like CUDA out of memory and overfitting/underfitting.

This project is a fundamental building block. From here, you can explore more advanced computer vision tasks, delve into deploying your models, or even experiment with different architectures and fine-tuning strategies. The journey into becoming a proficient AI/ML engineer is paved with such hands-on experiences!

References

  1. PyTorch Official Website: The primary resource for PyTorch documentation, tutorials, and installation guides. Always refer here for the latest stable versions and best practices.
  2. PyTorch torchvision Documentation: Detailed information on datasets, models, and transforms available in the torchvision library.
  3. PyTorch Transfer Learning Tutorial: An excellent official tutorial that covers transfer learning in PyTorch, which this chapter heavily draws inspiration from in terms of structure and concepts.
  4. ImageNet: The large-scale visual database that many pre-trained models are trained on, providing a foundation for transfer learning.
  5. Adam Optimizer Paper: While not directly referenced in the text, the Adam optimizer is a cornerstone of modern deep learning and its original paper provides deep insights into its mechanics.

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.