Introduction

Welcome to an exciting hands-on chapter where we’ll dive deep into the practical art of fine-tuning Large Language Models (LLMs)! You’ve learned about the power of these models, their architectures, and how they process language. Now, it’s time to make them truly yours by adapting them to perform a specific task that their general pre-training might not have fully covered.

In this chapter, you will learn how to take a pre-trained LLM and, with relatively small computational resources, specialize it for a new, targeted purpose. We’ll focus on Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly LoRA (Low-Rank Adaptation), which has revolutionized how we can adapt massive models without needing supercomputers. By the end of this project, you’ll have fine-tuned an LLM and tested its specialized capabilities, gaining invaluable experience in a crucial skill for modern AI engineers.

This project builds upon your understanding of deep learning, neural network training workflows, and model evaluation from previous chapters. Familiarity with Python, PyTorch, and the basics of the Hugging Face transformers library will be beneficial, but we’ll guide you through every step. Let’s get started and make an LLM smarter for your needs!

Core Concepts

Before we jump into the code, let’s establish a solid understanding of the core concepts that make LLM fine-tuning both possible and efficient.

What is Fine-Tuning?

Imagine you’ve taught a brilliant student (our pre-trained LLM) everything about the world – history, science, literature, art. Now, you need them to become an expert in a very niche field, like “identifying positive sentiment in customer reviews for a specific product line.” While the student has general knowledge, they need specialized training to excel at this particular task.

Fine-tuning is precisely that specialized training. We take a model that has already learned a vast amount of general knowledge from a massive dataset (its pre-training) and then train it further on a smaller, task-specific dataset. This process allows the model to adapt its existing knowledge to the nuances of the new task, often achieving impressive performance with much less data and computation than training from scratch.

Why not just train a small model from scratch for the specific task? Because LLMs, even after fine-tuning, retain much of their general understanding of language, grammar, and reasoning, which provides a powerful foundation that a small, task-specific model could never build on its own.

The Challenge of Full Fine-Tuning

LLMs are massive. Models like Llama 2 70B or GPT-4 have billions or even trillions of parameters. If we were to fine-tune all of these parameters on a new dataset, it would require:

  1. Enormous GPU Memory: Loading the entire model and its optimizers can easily consume hundreds of gigabytes of VRAM.
  2. Significant Computational Power: Updating billions of parameters for many iterations is computationally expensive and slow.
  3. Risk of Catastrophic Forgetting: Over-training on a small, specific dataset can sometimes make the model “forget” its general knowledge, degrading its performance on broader tasks.

These challenges make full fine-tuning impractical for most individual developers or smaller teams. This is where Parameter-Efficient Fine-Tuning (PEFT) comes to the rescue!

Parameter-Efficient Fine-Tuning (PEFT)

PEFT techniques are designed to address the challenges of full fine-tuning by only updating a small fraction of the model’s parameters, or by introducing a few new, small parameters, while keeping the majority of the pre-trained weights frozen. This drastically reduces computational cost, memory footprint, and the risk of catastrophic forgetting.

Think of it like this: instead of rewriting the entire textbook for our brilliant student, we just add a few specialized notes or a small supplementary chapter that focuses on the niche topic. The student still uses their core knowledge but now has specific guidance for the new task.

There are several PEFT methods, but one has emerged as a clear leader due to its simplicity, effectiveness, and widespread adoption: LoRA.

LoRA (Low-Rank Adaptation)

LoRA, or Low-Rank Adaptation, is a brilliant technique that inserts small, trainable matrices into the transformer architecture’s attention layers. Instead of directly fine-tuning the large weight matrices (W) of the pre-trained model, LoRA introduces two much smaller matrices, A and B, whose product approximates a low-rank update to W.

Here’s a conceptual diagram of how LoRA works:

graph TD A[Input Features] --> B[Pre-trained Weight Matrix W] B --> C[Output] subgraph LoRA Adapter D[Input Features] --> E[Matrix A] E --> F[Matrix B] F --> G[LoRA Output ΔW = A @ B] end B --- H[Add LoRA Output] G --> H H --> C

Explanation:

  • W is the large, pre-trained weight matrix (e.g., for query, key, or value projections in an attention layer). It remains frozen.
  • LoRA introduces two much smaller matrices, A and B. The input features are multiplied by A, then the result by B.
  • The output of this A @ B multiplication (ΔW) is then added to the output of the original W matrix.
  • Crucially, only A and B are trained. Since A and B have a much smaller rank r compared to the original matrix dimensions, the total number of trainable parameters is significantly reduced.

Why LoRA is Powerful:

  • Memory Efficiency: Freezing most weights means less memory for gradients.
  • Computational Efficiency: Fewer parameters to update means faster training.
  • Performance: Often achieves performance comparable to full fine-tuning.
  • Modular Adapters: You can train multiple LoRA adapters for different tasks and swap them in and out, or even combine them, without modifying the base model. This is incredibly flexible!

Supervised Fine-Tuning (SFT) Datasets

For fine-tuning, we typically use a technique called Supervised Fine-Tuning (SFT). This involves providing the model with examples of inputs and their desired outputs. For LLMs, this often takes the form of instruction-response pairs, like:

"Instruction: Summarize the following text: [TEXT]\nResponse: [SUMMARY]"

or

"Instruction: What is the capital of France?\nResponse: Paris"

The quality and format of your SFT dataset are paramount. A good dataset is:

  • Relevant: Directly pertains to the task you want the LLM to perform.
  • Diverse: Covers a wide range of examples within your task domain.
  • High-Quality: Free from errors, inconsistencies, and biases.
  • Formatted Correctly: Structured in a way that the model can easily learn from (e.g., consistent instruction/response templates).

Modern Tooling: Hugging Face Ecosystem (2026-01-17)

The Hugging Face ecosystem continues to be the de facto standard for working with LLMs. We’ll be using several key libraries:

  • transformers (version ~=4.37.0): Provides pre-trained models, tokenizers, and a unified API for various architectures.
  • peft (version ~=0.8.0): The Parameter-Efficient Fine-Tuning library, offering implementations of LoRA and other PEFT methods.
  • trl (version ~=0.7.10): The Transformer Reinforcement Learning library, which includes SFTTrainer for easy supervised fine-tuning.
  • datasets (version ~=2.16.1): For efficient loading, processing, and managing datasets.
  • bitsandbytes (version ~=0.42.0): Enables efficient 4-bit and 8-bit quantization, allowing you to load and fine-tune massive models on consumer GPUs.
  • accelerate (version ~=0.26.1): Simplifies distributed training and mixed-precision training.

These versions are stable and widely used as of early 2026. Always refer to the Hugging Face documentation for the absolute latest updates.

Step-by-Step Implementation: Fine-Tuning an LLM for Instruction Following

Our goal for this project is to fine-tune a small LLM (e.g., Mistral 7B) to follow instructions more precisely on a custom dataset. We’ll simulate a simple instruction-following task.

Environment Setup

First, let’s set up your environment. You’ll need Python 3.10 or newer. A GPU (NVIDIA preferred) with at least 12GB of VRAM is highly recommended for Mistral 7B with 4-bit quantization, though 8GB might work for smaller models or more aggressive quantization.

Open your terminal or command prompt and run the following commands:

# Create a new virtual environment (highly recommended!)
python -m venv llm_finetune_env
source llm_finetune_env/bin/activate  # On Windows: .\llm_finetune_env\Scripts\activate

# Install PyTorch (ensure you get the CUDA version if you have an NVIDIA GPU)
# Check https://pytorch.org/get-started/locally/ for the exact command for your CUDA version
# Example for CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install Hugging Face libraries and bitsandbytes
pip install transformers~=4.37.0 peft~=0.8.0 trl~=0.7.10 datasets~=2.16.1 bitsandbytes~=0.42.0 accelerate~=0.26.1

Note: The ~= operator in pip install means “compatible release.” This ensures you get a version that’s close to the specified one, avoiding breaking changes while still getting updates.

Step 1: Data Preparation

We’ll create a synthetic dataset for demonstration purposes. In a real-world scenario, you would curate this from actual data. Our dataset will consist of simple instruction-response pairs.

Create a new Python file, e.g., finetune_llm.py.

# finetune_llm.py

import torch
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import os

# --- 1. Data Preparation ---
print("Step 1: Preparing dataset...")

# Define our synthetic dataset
# Each entry is a dictionary with 'instruction' and 'response'
# We'll format this into a single 'text' field for the SFTTrainer
data = [
    {"instruction": "What is the capital of Canada?", "response": "The capital of Canada is Ottawa."},
    {"instruction": "Name two types of big cats.", "response": "Two types of big cats are lions and tigers."},
    {"instruction": "How do you say 'hello' in Spanish?", "response": "You say 'hola' in Spanish."},
    {"instruction": "Explain the concept of photosynthesis briefly.", "response": "Photosynthesis is the process by which green plants convert light energy into chemical energy, producing oxygen as a byproduct."},
    {"instruction": "Who painted the Mona Lisa?", "response": "The Mona Lisa was painted by Leonardo da Vinci."},
    {"instruction": "What is 2 + 2?", "response": "2 + 2 equals 4."},
    {"instruction": "Tell me a fun fact about space.", "response": "A full NASA space suit costs about $12 million."},
    {"instruction": "What is the largest ocean on Earth?", "response": "The Pacific Ocean is the largest ocean on Earth."},
    {"instruction": "Define 'algorithm'.", "response": "An algorithm is a set of step-by-step instructions or rules designed to solve a problem or perform a task."},
    {"instruction": "What is the main ingredient in guacamole?", "response": "The main ingredient in guacamole is avocado."}
]

# Convert to Hugging Face Dataset format
# SFTTrainer expects a 'text' column, so we'll create that
def format_instruction(example):
    return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"}

dataset = Dataset.from_list(data)
dataset = dataset.map(format_instruction)

print(f"Dataset created with {len(dataset)} examples. First example:\n{dataset[0]['text']}\n")

# Splitting into train and test is good practice, even for small datasets
# For this small dataset, we'll use all for training for simplicity,
# but in a real project, always split your data!
train_dataset = dataset
# eval_dataset = dataset.select(range(2)) # Example for a tiny eval set

Explanation:

  • We define a list of dictionaries, each representing an instruction and its desired response.
  • The format_instruction function converts these into a single string following a specific template (### Instruction:\n...\n\n### Response:\n...). This template is crucial because the model learns to generate text in this format. When you later prompt the fine-tuned model, you’ll use the ### Instruction:\nYOUR_PROMPT\n\n### Response: part to tell it what to do.
  • Dataset.from_list() creates a Hugging Face Dataset object.
  • .map() applies our formatting function to each example.

Step 2: Load Base Model and Tokenizer with Quantization

We’ll use a small, performant open-source model like Mistral-7B-Instruct-v0.2. To fit it into GPU memory, we’ll use 4-bit quantization via BitsAndBytesConfig.

Continue adding to finetune_llm.py:

# --- 2. Load Base Model and Tokenizer ---
print("Step 2: Loading base model and tokenizer...")

model_name = "mistralai/Mistral-7B-Instruct-v0.2" # Using Mistral-7B-Instruct-v0.2
# model_name = "meta-llama/Llama-2-7b-hf" # Another good option, requires Hugging Face login

# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4", # Normalized float 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bfloat16 for speed and stability
    bnb_4bit_use_double_quant=False, # Optional: double quantization for even smaller memory footprint
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set padding token to EOS token
tokenizer.padding_side = "right" # Important for causal LMs

# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto" # Automatically maps model layers to available devices
)
model.config.use_cache = False # Disable cache for training to save memory
model.config.pretraining_tp = 1 # Required for Mistral models

# Prepare model for k-bit training (important for LoRA)
# This casts the layer norms to float32 and enables gradient checkpointing
model = prepare_model_for_kbit_training(model)

print("Base model and tokenizer loaded.")

Explanation:

  • model_name: Specifies the pre-trained model from Hugging Face Hub.
  • BitsAndBytesConfig: This is the magic for memory efficiency.
    • load_in_4bit=True: Loads the model weights in 4-bit precision.
    • bnb_4bit_quant_type="nf4": Uses the “NormalFloat 4” quantization scheme, which is optimal for neural networks.
    • bnb_4bit_compute_dtype=torch.bfloat16: Specifies that computations (like activations and gradients) should happen in bfloat16 (Brain Floating Point 16-bit). This offers a good balance of precision and speed on modern GPUs.
    • device_map="auto": Hugging Face accelerate figures out how to best distribute the model across your GPUs.
  • tokenizer.pad_token = tokenizer.eos_token: Sets the padding token. This is crucial for batching sequences of different lengths.
  • tokenizer.padding_side = "right": For causal language models (which predict the next token), padding on the right is standard.
  • model.config.use_cache = False: Disables attention caching during training, which reduces memory usage.
  • prepare_model_for_kbit_training: A utility function from peft that performs necessary modifications (e.g., casting layer norms to float32) to make the quantized model trainable with LoRA.

Step 3: Configure LoRA

Now, we’ll tell peft how to set up the LoRA adapters.

Continue adding to finetune_llm.py:

# --- 3. Configure LoRA ---
print("Step 3: Configuring LoRA...")

lora_config = LoraConfig(
    r=16, # LoRA attention dimension (rank). Common values: 8, 16, 32, 64
    lora_alpha=32, # Alpha parameter for LoRA scaling. Usually twice `r`.
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA to
    bias="none", # We don't fine-tune biases with LoRA typically
    lora_dropout=0.05, # Dropout probability for LoRA layers
    task_type="CAUSAL_LM", # Specify the task type
)

# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # See how many parameters are now trainable

print("LoRA configured and applied to the model.")

Explanation:

  • LoraConfig: Defines the specifics of our LoRA setup.
    • r (rank): This is the most important hyperparameter. It determines the dimensionality of the low-rank matrices. Higher r means more trainable parameters, potentially better performance, but also more memory and computation. Common values are 8, 16, 32, 64. We start with 16.
    • lora_alpha: A scaling factor for the LoRA weights. Often set to 2 * r.
    • target_modules: This is a list of the names of the linear layers in the transformer model where LoRA adapters will be injected. For Mistral, q_proj, k_proj, v_proj, o_proj (from attention), and gate_proj, up_proj, down_proj (from MLP) are common choices. You can inspect model.named_modules() to find layer names.
    • bias="none": We typically don’t fine-tune bias terms with LoRA.
    • lora_dropout: Applies dropout to the LoRA layers to prevent overfitting.
    • task_type="CAUSAL_LM": Tells PEFT that we are fine-tuning a causal language model.
  • get_peft_model(model, lora_config): This function from peft intelligently wraps your base model with the LoRA adapters, making it ready for training.
  • model.print_trainable_parameters(): This useful function shows you how many parameters are now trainable (only the LoRA adapters) versus the total model parameters. You’ll see a dramatic reduction!

Step 4: Set up Training Arguments

We need to define how the training process itself will run (learning rate, epochs, logging, etc.).

Continue adding to finetune_llm.py:

# --- 4. Set up Training Arguments ---
print("Step 4: Setting up training arguments...")

output_dir = "./results" # Directory to save checkpoints and logs
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3, # Number of training epochs (iterations over the dataset)
    per_device_train_batch_size=4, # Batch size per GPU (adjust based on VRAM)
    gradient_accumulation_steps=2, # Accumulate gradients over multiple steps to simulate larger batch size
    optim="paged_adamw_8bit", # Optimizer: paged AdamW for memory efficiency
    learning_rate=2e-4, # Learning rate for fine-tuning
    logging_steps=10, # Log training metrics every N steps
    save_strategy="epoch", # Save model checkpoint after each epoch
    report_to="none", # Or "tensorboard", "wandb" for tracking
    fp16=False, # Use bfloat16 if your GPU supports it, otherwise set to False and rely on bnb_4bit_compute_dtype
    bf16=True, # Enable bfloat16 training if supported (e.g., NVIDIA Ampere GPUs and newer)
    max_grad_norm=0.3, # Clip gradients to prevent exploding gradients
    warmup_ratio=0.03, # Linear warmup for learning rate scheduler
    lr_scheduler_type="cosine", # Learning rate scheduler type
    disable_tqdm=False, # Enable progress bar
)

print("Training arguments configured.")

Explanation:

  • TrainingArguments: This class from transformers bundles all training-related parameters.
    • output_dir: Where checkpoints and logs will be saved.
    • num_train_epochs: For a small dataset, 3-5 epochs are usually sufficient. More can lead to overfitting.
    • per_device_train_batch_size: How many examples are processed per GPU at once. Adjust this down if you hit OOM errors.
    • gradient_accumulation_steps: If your GPU can’t fit a large batch_size, you can accumulate gradients over several smaller batches. batch_size=4, gradient_accumulation_steps=2 effectively simulates a batch size of 8.
    • optim="paged_adamw_8bit": A memory-efficient AdamW optimizer from bitsandbytes that pages optimizer states to CPU when not needed.
    • learning_rate: A critical hyperparameter. For fine-tuning, it’s typically smaller than pre-training rates (e.g., 1e-5 to 5e-4).
    • logging_steps, save_strategy: Control how often metrics are logged and models are saved.
    • report_to="none": You can integrate with tools like Weights & Biases (wandb) or TensorBoard for better experiment tracking.
    • fp16=False, bf16=True: For modern GPUs (NVIDIA Ampere and newer), bf16 is generally preferred over fp16 for deep learning due to its wider dynamic range. Set fp16=True if your GPU doesn’t support bf16.
    • max_grad_norm: Clips gradients to prevent them from becoming too large, which can destabilize training.
    • warmup_ratio, lr_scheduler_type: Control the learning rate schedule.

Step 5: Initialize SFTTrainer

The SFTTrainer from trl simplifies the fine-tuning process for instruction-tuned models.

Continue adding to finetune_llm.py:

# --- 5. Initialize SFTTrainer ---
print("Step 5: Initializing SFTTrainer...")

trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset,
    # eval_dataset=eval_dataset, # Uncomment if you have an eval_dataset
    peft_config=lora_config, # Pass the LoRA configuration
    dataset_text_field="text", # The column in our dataset containing the formatted text
    max_seq_length=512, # Maximum sequence length for the model. Adjust based on your data and GPU memory.
    tokenizer=tokenizer,
    args=training_arguments,
    packing=False, # Set to True for more efficient GPU usage by packing multiple short examples into one sequence
)

print("SFTTrainer initialized.")

Explanation:

  • SFTTrainer: Takes our model, dataset, LoRA config, tokenizer, and training arguments.
  • dataset_text_field="text": Crucially tells the trainer which column in our Dataset contains the text we want to train on.
  • max_seq_length: The maximum length of sequences fed to the model. Longer sequences require more memory. You might need to reduce this if you run into OOM errors.
  • packing=False: If True, SFTTrainer tries to concatenate multiple short examples into a single, longer sequence to make better use of GPU memory. For our tiny dataset, False is fine, but for larger datasets with many short texts, True can be a significant optimization.

Step 6: Train the Model

The moment of truth!

Continue adding to finetune_llm.py:

# --- 6. Train the Model ---
print("Step 6: Starting model training...")

trainer.train()

print("Model training complete!")

# --- 7. Save the Fine-tuned Adapter ---
print("Step 7: Saving the fine-tuned LoRA adapter...")
trainer.save_model(os.path.join(output_dir, "final_checkpoint"))
print(f"LoRA adapter saved to {os.path.join(output_dir, 'final_checkpoint')}")

Explanation:

  • trainer.train(): Kicks off the training loop. You’ll see a progress bar and logged metrics (loss, learning rate, etc.).
  • trainer.save_model(): Saves the LoRA adapter weights to the specified directory. It does not save the entire base model, only the small, trained LoRA matrices. This is a huge advantage of PEFT – small checkpoints!

Step 8: Inference with the Fine-tuned Model

Now, let’s see how our fine-tuned model performs! We’ll load the base model, then load our LoRA adapter weights on top of it.

Continue adding to finetune_llm.py:

# --- 8. Inference with the Fine-tuned Model ---
print("\nStep 8: Performing inference with the fine-tuned model...")

# Load the base model again (without quantization for simplicity, or with if you prefer)
# For production, you'd likely load the base model with the same quantization
# and then add the adapter.
base_model_for_inference = AutoModelForCausalLM.from_pretrained(
    model_name,
    return_dict=True,
    torch_dtype=torch.bfloat16, # Use bfloat16 for inference if possible
    device_map="auto",
)

# Load the LoRA adapter weights
from peft import PeftModel

model_path = os.path.join(output_dir, "final_checkpoint")
fine_tuned_model = PeftModel.from_pretrained(base_model_for_inference, model_path)

# You can optionally merge the LoRA weights into the base model for faster inference
# This will create a new full model with the combined weights
# fine_tuned_model = fine_tuned_model.merge_and_unload()

# Get the tokenizer
inference_tokenizer = AutoTokenizer.from_pretrained(model_name)
inference_tokenizer.pad_token = inference_tokenizer.eos_token
inference_tokenizer.padding_side = "right"

# Test the fine-tuned model
def generate_response(instruction, model, tokenizer):
    # Format the instruction exactly as during training
    prompt = f"### Instruction:\n{instruction}\n\n### Response:"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Ensure inputs are on GPU

    # Generate output
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100, # Max length of the generated response
            do_sample=True, # Enable sampling for more creative responses
            top_k=50, # Consider only top 50 most probable tokens
            top_p=0.95, # Nucleus sampling
            temperature=0.7, # Controls randomness: lower means more deterministic
            eos_token_id=tokenizer.eos_token_id, # Stop generation at EOS token
        )

    # Decode and print the response
    response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
    return response.strip()

# Test cases
test_instructions = [
    "What is the capital of Japan?",
    "Tell me a short story about a brave knight.",
    "What is the chemical symbol for water?",
    "What is the best way to learn programming?",
]

print("\n--- Testing Fine-Tuned Model ---")
for inst in test_instructions:
    print(f"\nInstruction: {inst}")
    response = generate_response(inst, fine_tuned_model, inference_tokenizer)
    print(f"Response: {response}")

print("\n--- Testing Original Model (Optional, for comparison) ---")
# To compare, you'd load the original model and tokenizer without PEFT
# and run the same generate_response function.
# For simplicity, we'll skip loading the original model again here,
# but you can try it to see the difference!

Explanation:

  • AutoModelForCausalLM.from_pretrained(...): We load the original base model.
  • PeftModel.from_pretrained(base_model_for_inference, model_path): This is where the magic happens! We load our tiny LoRA adapter weights and peft intelligently applies them to the base model, creating a PeftModel ready for inference.
  • merge_and_unload(): An optional step. It merges the LoRA weights directly into the base model’s original weight matrices. This results in a full, modified model that no longer needs the peft wrapper, potentially offering slightly faster inference or easier deployment, but it makes the model larger.
  • generate_response function:
    • It formats the prompt using the exact same template used during training. This is absolutely critical for the model to understand the instruction.
    • It uses model.generate() with various generation parameters (max_new_tokens, do_sample, temperature, top_k, top_p) to control the quality and creativity of the generated text.
    • It decodes the generated tokens and extracts only the new response.

Mini-Challenge

Challenge: Experiment with different LoRA configurations.

  1. Change r and lora_alpha: In LoraConfig, try setting r=8 and lora_alpha=16, or r=32 and lora_alpha=64.
  2. Modify target_modules: Try fine-tuning only the attention projection layers (["q_proj", "k_proj", "v_proj", "o_proj"]) or adding all linear layers you can find using model.named_modules().
  3. Adjust num_train_epochs: Try training for just 1 epoch or for 5 epochs.

Hint: Remember that r directly impacts the number of trainable parameters. A smaller r means fewer parameters, faster training, and less memory, but might limit the model’s ability to adapt. A larger r might capture more nuances but requires more resources.

What to observe/learn:

  • How does changing r affect the “Trainable params” reported by model.print_trainable_parameters()?
  • Does increasing r lead to better responses for your specific task, or does it overfit the small dataset?
  • How does the training time change with different r values?
  • Do you notice any difference in the quality of the generated responses when you change the number of epochs or target modules?

Common Pitfalls & Troubleshooting

  1. Out of Memory (OOM) Errors:

    • Symptom: Your script crashes with a message like CUDA out of memory.
    • Solution:
      • Reduce per_device_train_batch_size: This is the first thing to try. Make it as small as 1 if necessary.
      • Increase gradient_accumulation_steps: If you reduce batch_size, compensate by increasing gradient_accumulation_steps to maintain a similar effective batch size.
      • Reduce max_seq_length: Shorter sequences consume less memory.
      • Use bitsandbytes quantization: Ensure load_in_4bit=True is correctly configured and bitsandbytes is installed.
      • Reduce r in LoraConfig: Fewer LoRA parameters mean less memory for their gradients.
      • Close other GPU-intensive applications.
      • Use a smaller base model: If Mistral 7B is too large, consider models like TinyLlama/TinyLlama-1.1B-Chat-v1.0 or microsoft/phi-2.
  2. Poor Model Performance / Model Hallucinating:

    • Symptom: The fine-tuned model doesn’t follow instructions well, generates nonsensical responses, or doesn’t improve over the base model.
    • Solution:
      • Data Quality and Quantity: Is your dataset truly high-quality and representative of the task? For real-world tasks, 10 examples are far too few. Aim for hundreds or thousands of diverse, well-formatted examples.
      • Instruction Formatting: Double-check that your training data format (### Instruction:\n...\n\n### Response:) is exactly matched during inference. This is a common mistake.
      • Hyperparameter Tuning: Experiment with learning_rate, num_train_epochs, lora_alpha, r, and lora_dropout. These are critical.
      • target_modules in LoRA: Ensure you’re fine-tuning the most relevant layers. For Mistral, the provided list is a good starting point.
      • max_new_tokens in generate(): If responses are too short, increase this.
      • Generation Parameters: Adjust temperature, top_k, top_p. Higher temperature means more creative but potentially less coherent output.
  3. Installation Issues:

    • Symptom: pip install errors, ModuleNotFoundError.
    • Solution:
      • Virtual Environment: Always use a virtual environment to avoid dependency conflicts.
      • PyTorch CUDA: Ensure you’ve installed the correct PyTorch version for your CUDA toolkit. Check nvcc --version for your CUDA version and use the pytorch.org instructions.
      • bitsandbytes: This library can sometimes be tricky. Ensure your CUDA drivers are up to date. If issues persist, try installing bitsandbytes from source or using a specific pre-compiled wheel for your CUDA version if available.
      • Version Compatibility: While ~= helps, sometimes specific minor versions might have issues. Check GitHub issues for the libraries if you encounter persistent problems.

Summary

Congratulations! You’ve successfully navigated the complex world of LLM fine-tuning, applied Parameter-Efficient Fine-Tuning (PEFT) with LoRA, and specialized a powerful pre-trained model for a custom task.

Here are the key takeaways from this chapter:

  • Fine-tuning adapts pre-trained LLMs to specific tasks, leveraging their vast general knowledge.
  • Full fine-tuning is often too resource-intensive due to the immense size of LLMs.
  • Parameter-Efficient Fine-Tuning (PEFT) techniques, like LoRA, dramatically reduce computational requirements by training only a small fraction of parameters.
  • LoRA injects low-rank matrices into attention layers, allowing efficient adaptation.
  • The Hugging Face ecosystem (transformers, peft, trl, datasets, bitsandbytes, accelerate) provides the essential tools for this process.
  • 4-bit quantization with bitsandbytes is crucial for fitting large models on consumer GPUs.
  • Data quality and format (especially instruction-response templates) are critical for effective fine-tuning.
  • Hyperparameter tuning (e.g., r, lora_alpha, learning_rate, epochs) significantly impacts performance.
  • Inference requires loading the base model and then applying the trained LoRA adapter.

You now have a foundational understanding and hands-on experience with one of the most important techniques in modern AI development. This skill is highly sought after and opens doors to building truly custom, powerful AI applications.

What’s next? In the upcoming chapters, we’ll continue to build on this expertise, exploring more advanced PEFT methods, evaluating models more rigorously, and delving into how to deploy these fine-tuned models for real-world use cases. You’re well on your way to becoming a proficient AI/ML engineer!


References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.