Introduction
Welcome to an exciting hands-on chapter where we’ll dive deep into the practical art of fine-tuning Large Language Models (LLMs)! You’ve learned about the power of these models, their architectures, and how they process language. Now, it’s time to make them truly yours by adapting them to perform a specific task that their general pre-training might not have fully covered.
In this chapter, you will learn how to take a pre-trained LLM and, with relatively small computational resources, specialize it for a new, targeted purpose. We’ll focus on Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly LoRA (Low-Rank Adaptation), which has revolutionized how we can adapt massive models without needing supercomputers. By the end of this project, you’ll have fine-tuned an LLM and tested its specialized capabilities, gaining invaluable experience in a crucial skill for modern AI engineers.
This project builds upon your understanding of deep learning, neural network training workflows, and model evaluation from previous chapters. Familiarity with Python, PyTorch, and the basics of the Hugging Face transformers library will be beneficial, but we’ll guide you through every step. Let’s get started and make an LLM smarter for your needs!
Core Concepts
Before we jump into the code, let’s establish a solid understanding of the core concepts that make LLM fine-tuning both possible and efficient.
What is Fine-Tuning?
Imagine you’ve taught a brilliant student (our pre-trained LLM) everything about the world – history, science, literature, art. Now, you need them to become an expert in a very niche field, like “identifying positive sentiment in customer reviews for a specific product line.” While the student has general knowledge, they need specialized training to excel at this particular task.
Fine-tuning is precisely that specialized training. We take a model that has already learned a vast amount of general knowledge from a massive dataset (its pre-training) and then train it further on a smaller, task-specific dataset. This process allows the model to adapt its existing knowledge to the nuances of the new task, often achieving impressive performance with much less data and computation than training from scratch.
Why not just train a small model from scratch for the specific task? Because LLMs, even after fine-tuning, retain much of their general understanding of language, grammar, and reasoning, which provides a powerful foundation that a small, task-specific model could never build on its own.
The Challenge of Full Fine-Tuning
LLMs are massive. Models like Llama 2 70B or GPT-4 have billions or even trillions of parameters. If we were to fine-tune all of these parameters on a new dataset, it would require:
- Enormous GPU Memory: Loading the entire model and its optimizers can easily consume hundreds of gigabytes of VRAM.
- Significant Computational Power: Updating billions of parameters for many iterations is computationally expensive and slow.
- Risk of Catastrophic Forgetting: Over-training on a small, specific dataset can sometimes make the model “forget” its general knowledge, degrading its performance on broader tasks.
These challenges make full fine-tuning impractical for most individual developers or smaller teams. This is where Parameter-Efficient Fine-Tuning (PEFT) comes to the rescue!
Parameter-Efficient Fine-Tuning (PEFT)
PEFT techniques are designed to address the challenges of full fine-tuning by only updating a small fraction of the model’s parameters, or by introducing a few new, small parameters, while keeping the majority of the pre-trained weights frozen. This drastically reduces computational cost, memory footprint, and the risk of catastrophic forgetting.
Think of it like this: instead of rewriting the entire textbook for our brilliant student, we just add a few specialized notes or a small supplementary chapter that focuses on the niche topic. The student still uses their core knowledge but now has specific guidance for the new task.
There are several PEFT methods, but one has emerged as a clear leader due to its simplicity, effectiveness, and widespread adoption: LoRA.
LoRA (Low-Rank Adaptation)
LoRA, or Low-Rank Adaptation, is a brilliant technique that inserts small, trainable matrices into the transformer architecture’s attention layers. Instead of directly fine-tuning the large weight matrices (W) of the pre-trained model, LoRA introduces two much smaller matrices, A and B, whose product approximates a low-rank update to W.
Here’s a conceptual diagram of how LoRA works:
Explanation:
Wis the large, pre-trained weight matrix (e.g., for query, key, or value projections in an attention layer). It remains frozen.- LoRA introduces two much smaller matrices,
AandB. The input features are multiplied byA, then the result byB. - The output of this
A @ Bmultiplication (ΔW) is then added to the output of the originalWmatrix. - Crucially, only
AandBare trained. SinceAandBhave a much smaller rankrcompared to the original matrix dimensions, the total number of trainable parameters is significantly reduced.
Why LoRA is Powerful:
- Memory Efficiency: Freezing most weights means less memory for gradients.
- Computational Efficiency: Fewer parameters to update means faster training.
- Performance: Often achieves performance comparable to full fine-tuning.
- Modular Adapters: You can train multiple LoRA adapters for different tasks and swap them in and out, or even combine them, without modifying the base model. This is incredibly flexible!
Supervised Fine-Tuning (SFT) Datasets
For fine-tuning, we typically use a technique called Supervised Fine-Tuning (SFT). This involves providing the model with examples of inputs and their desired outputs. For LLMs, this often takes the form of instruction-response pairs, like:
"Instruction: Summarize the following text: [TEXT]\nResponse: [SUMMARY]"
or
"Instruction: What is the capital of France?\nResponse: Paris"
The quality and format of your SFT dataset are paramount. A good dataset is:
- Relevant: Directly pertains to the task you want the LLM to perform.
- Diverse: Covers a wide range of examples within your task domain.
- High-Quality: Free from errors, inconsistencies, and biases.
- Formatted Correctly: Structured in a way that the model can easily learn from (e.g., consistent instruction/response templates).
Modern Tooling: Hugging Face Ecosystem (2026-01-17)
The Hugging Face ecosystem continues to be the de facto standard for working with LLMs. We’ll be using several key libraries:
transformers(version ~=4.37.0): Provides pre-trained models, tokenizers, and a unified API for various architectures.peft(version ~=0.8.0): The Parameter-Efficient Fine-Tuning library, offering implementations of LoRA and other PEFT methods.trl(version ~=0.7.10): The Transformer Reinforcement Learning library, which includesSFTTrainerfor easy supervised fine-tuning.datasets(version ~=2.16.1): For efficient loading, processing, and managing datasets.bitsandbytes(version ~=0.42.0): Enables efficient 4-bit and 8-bit quantization, allowing you to load and fine-tune massive models on consumer GPUs.accelerate(version ~=0.26.1): Simplifies distributed training and mixed-precision training.
These versions are stable and widely used as of early 2026. Always refer to the Hugging Face documentation for the absolute latest updates.
Step-by-Step Implementation: Fine-Tuning an LLM for Instruction Following
Our goal for this project is to fine-tune a small LLM (e.g., Mistral 7B) to follow instructions more precisely on a custom dataset. We’ll simulate a simple instruction-following task.
Environment Setup
First, let’s set up your environment. You’ll need Python 3.10 or newer. A GPU (NVIDIA preferred) with at least 12GB of VRAM is highly recommended for Mistral 7B with 4-bit quantization, though 8GB might work for smaller models or more aggressive quantization.
Open your terminal or command prompt and run the following commands:
# Create a new virtual environment (highly recommended!)
python -m venv llm_finetune_env
source llm_finetune_env/bin/activate # On Windows: .\llm_finetune_env\Scripts\activate
# Install PyTorch (ensure you get the CUDA version if you have an NVIDIA GPU)
# Check https://pytorch.org/get-started/locally/ for the exact command for your CUDA version
# Example for CUDA 12.1:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install Hugging Face libraries and bitsandbytes
pip install transformers~=4.37.0 peft~=0.8.0 trl~=0.7.10 datasets~=2.16.1 bitsandbytes~=0.42.0 accelerate~=0.26.1
Note: The
~=operator inpip installmeans “compatible release.” This ensures you get a version that’s close to the specified one, avoiding breaking changes while still getting updates.
Step 1: Data Preparation
We’ll create a synthetic dataset for demonstration purposes. In a real-world scenario, you would curate this from actual data. Our dataset will consist of simple instruction-response pairs.
Create a new Python file, e.g., finetune_llm.py.
# finetune_llm.py
import torch
from datasets import Dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
import os
# --- 1. Data Preparation ---
print("Step 1: Preparing dataset...")
# Define our synthetic dataset
# Each entry is a dictionary with 'instruction' and 'response'
# We'll format this into a single 'text' field for the SFTTrainer
data = [
{"instruction": "What is the capital of Canada?", "response": "The capital of Canada is Ottawa."},
{"instruction": "Name two types of big cats.", "response": "Two types of big cats are lions and tigers."},
{"instruction": "How do you say 'hello' in Spanish?", "response": "You say 'hola' in Spanish."},
{"instruction": "Explain the concept of photosynthesis briefly.", "response": "Photosynthesis is the process by which green plants convert light energy into chemical energy, producing oxygen as a byproduct."},
{"instruction": "Who painted the Mona Lisa?", "response": "The Mona Lisa was painted by Leonardo da Vinci."},
{"instruction": "What is 2 + 2?", "response": "2 + 2 equals 4."},
{"instruction": "Tell me a fun fact about space.", "response": "A full NASA space suit costs about $12 million."},
{"instruction": "What is the largest ocean on Earth?", "response": "The Pacific Ocean is the largest ocean on Earth."},
{"instruction": "Define 'algorithm'.", "response": "An algorithm is a set of step-by-step instructions or rules designed to solve a problem or perform a task."},
{"instruction": "What is the main ingredient in guacamole?", "response": "The main ingredient in guacamole is avocado."}
]
# Convert to Hugging Face Dataset format
# SFTTrainer expects a 'text' column, so we'll create that
def format_instruction(example):
return {"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"}
dataset = Dataset.from_list(data)
dataset = dataset.map(format_instruction)
print(f"Dataset created with {len(dataset)} examples. First example:\n{dataset[0]['text']}\n")
# Splitting into train and test is good practice, even for small datasets
# For this small dataset, we'll use all for training for simplicity,
# but in a real project, always split your data!
train_dataset = dataset
# eval_dataset = dataset.select(range(2)) # Example for a tiny eval set
Explanation:
- We define a list of dictionaries, each representing an instruction and its desired response.
- The
format_instructionfunction converts these into a single string following a specific template (### Instruction:\n...\n\n### Response:\n...). This template is crucial because the model learns to generate text in this format. When you later prompt the fine-tuned model, you’ll use the### Instruction:\nYOUR_PROMPT\n\n### Response:part to tell it what to do. Dataset.from_list()creates a Hugging FaceDatasetobject..map()applies our formatting function to each example.
Step 2: Load Base Model and Tokenizer with Quantization
We’ll use a small, performant open-source model like Mistral-7B-Instruct-v0.2. To fit it into GPU memory, we’ll use 4-bit quantization via BitsAndBytesConfig.
Continue adding to finetune_llm.py:
# --- 2. Load Base Model and Tokenizer ---
print("Step 2: Loading base model and tokenizer...")
model_name = "mistralai/Mistral-7B-Instruct-v0.2" # Using Mistral-7B-Instruct-v0.2
# model_name = "meta-llama/Llama-2-7b-hf" # Another good option, requires Hugging Face login
# Configure 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # Normalized float 4-bit
bnb_4bit_compute_dtype=torch.bfloat16, # Computation in bfloat16 for speed and stability
bnb_4bit_use_double_quant=False, # Optional: double quantization for even smaller memory footprint
)
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token # Set padding token to EOS token
tokenizer.padding_side = "right" # Important for causal LMs
# Load the model with quantization
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto" # Automatically maps model layers to available devices
)
model.config.use_cache = False # Disable cache for training to save memory
model.config.pretraining_tp = 1 # Required for Mistral models
# Prepare model for k-bit training (important for LoRA)
# This casts the layer norms to float32 and enables gradient checkpointing
model = prepare_model_for_kbit_training(model)
print("Base model and tokenizer loaded.")
Explanation:
model_name: Specifies the pre-trained model from Hugging Face Hub.BitsAndBytesConfig: This is the magic for memory efficiency.load_in_4bit=True: Loads the model weights in 4-bit precision.bnb_4bit_quant_type="nf4": Uses the “NormalFloat 4” quantization scheme, which is optimal for neural networks.bnb_4bit_compute_dtype=torch.bfloat16: Specifies that computations (like activations and gradients) should happen inbfloat16(Brain Floating Point 16-bit). This offers a good balance of precision and speed on modern GPUs.device_map="auto": Hugging Faceacceleratefigures out how to best distribute the model across your GPUs.
tokenizer.pad_token = tokenizer.eos_token: Sets the padding token. This is crucial for batching sequences of different lengths.tokenizer.padding_side = "right": For causal language models (which predict the next token), padding on the right is standard.model.config.use_cache = False: Disables attention caching during training, which reduces memory usage.prepare_model_for_kbit_training: A utility function frompeftthat performs necessary modifications (e.g., casting layer norms tofloat32) to make the quantized model trainable with LoRA.
Step 3: Configure LoRA
Now, we’ll tell peft how to set up the LoRA adapters.
Continue adding to finetune_llm.py:
# --- 3. Configure LoRA ---
print("Step 3: Configuring LoRA...")
lora_config = LoraConfig(
r=16, # LoRA attention dimension (rank). Common values: 8, 16, 32, 64
lora_alpha=32, # Alpha parameter for LoRA scaling. Usually twice `r`.
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], # Modules to apply LoRA to
bias="none", # We don't fine-tune biases with LoRA typically
lora_dropout=0.05, # Dropout probability for LoRA layers
task_type="CAUSAL_LM", # Specify the task type
)
# Apply LoRA to the model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters() # See how many parameters are now trainable
print("LoRA configured and applied to the model.")
Explanation:
LoraConfig: Defines the specifics of our LoRA setup.r(rank): This is the most important hyperparameter. It determines the dimensionality of the low-rank matrices. Higherrmeans more trainable parameters, potentially better performance, but also more memory and computation. Common values are 8, 16, 32, 64. We start with 16.lora_alpha: A scaling factor for the LoRA weights. Often set to2 * r.target_modules: This is a list of the names of the linear layers in the transformer model where LoRA adapters will be injected. For Mistral,q_proj,k_proj,v_proj,o_proj(from attention), andgate_proj,up_proj,down_proj(from MLP) are common choices. You can inspectmodel.named_modules()to find layer names.bias="none": We typically don’t fine-tune bias terms with LoRA.lora_dropout: Applies dropout to the LoRA layers to prevent overfitting.task_type="CAUSAL_LM": Tells PEFT that we are fine-tuning a causal language model.
get_peft_model(model, lora_config): This function frompeftintelligently wraps your base model with the LoRA adapters, making it ready for training.model.print_trainable_parameters(): This useful function shows you how many parameters are now trainable (only the LoRA adapters) versus the total model parameters. You’ll see a dramatic reduction!
Step 4: Set up Training Arguments
We need to define how the training process itself will run (learning rate, epochs, logging, etc.).
Continue adding to finetune_llm.py:
# --- 4. Set up Training Arguments ---
print("Step 4: Setting up training arguments...")
output_dir = "./results" # Directory to save checkpoints and logs
training_arguments = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3, # Number of training epochs (iterations over the dataset)
per_device_train_batch_size=4, # Batch size per GPU (adjust based on VRAM)
gradient_accumulation_steps=2, # Accumulate gradients over multiple steps to simulate larger batch size
optim="paged_adamw_8bit", # Optimizer: paged AdamW for memory efficiency
learning_rate=2e-4, # Learning rate for fine-tuning
logging_steps=10, # Log training metrics every N steps
save_strategy="epoch", # Save model checkpoint after each epoch
report_to="none", # Or "tensorboard", "wandb" for tracking
fp16=False, # Use bfloat16 if your GPU supports it, otherwise set to False and rely on bnb_4bit_compute_dtype
bf16=True, # Enable bfloat16 training if supported (e.g., NVIDIA Ampere GPUs and newer)
max_grad_norm=0.3, # Clip gradients to prevent exploding gradients
warmup_ratio=0.03, # Linear warmup for learning rate scheduler
lr_scheduler_type="cosine", # Learning rate scheduler type
disable_tqdm=False, # Enable progress bar
)
print("Training arguments configured.")
Explanation:
TrainingArguments: This class fromtransformersbundles all training-related parameters.output_dir: Where checkpoints and logs will be saved.num_train_epochs: For a small dataset, 3-5 epochs are usually sufficient. More can lead to overfitting.per_device_train_batch_size: How many examples are processed per GPU at once. Adjust this down if you hit OOM errors.gradient_accumulation_steps: If your GPU can’t fit a largebatch_size, you can accumulate gradients over several smaller batches.batch_size=4, gradient_accumulation_steps=2effectively simulates a batch size of 8.optim="paged_adamw_8bit": A memory-efficient AdamW optimizer frombitsandbytesthat pages optimizer states to CPU when not needed.learning_rate: A critical hyperparameter. For fine-tuning, it’s typically smaller than pre-training rates (e.g.,1e-5to5e-4).logging_steps,save_strategy: Control how often metrics are logged and models are saved.report_to="none": You can integrate with tools like Weights & Biases (wandb) or TensorBoard for better experiment tracking.fp16=False, bf16=True: For modern GPUs (NVIDIA Ampere and newer),bf16is generally preferred overfp16for deep learning due to its wider dynamic range. Setfp16=Trueif your GPU doesn’t supportbf16.max_grad_norm: Clips gradients to prevent them from becoming too large, which can destabilize training.warmup_ratio,lr_scheduler_type: Control the learning rate schedule.
Step 5: Initialize SFTTrainer
The SFTTrainer from trl simplifies the fine-tuning process for instruction-tuned models.
Continue adding to finetune_llm.py:
# --- 5. Initialize SFTTrainer ---
print("Step 5: Initializing SFTTrainer...")
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
# eval_dataset=eval_dataset, # Uncomment if you have an eval_dataset
peft_config=lora_config, # Pass the LoRA configuration
dataset_text_field="text", # The column in our dataset containing the formatted text
max_seq_length=512, # Maximum sequence length for the model. Adjust based on your data and GPU memory.
tokenizer=tokenizer,
args=training_arguments,
packing=False, # Set to True for more efficient GPU usage by packing multiple short examples into one sequence
)
print("SFTTrainer initialized.")
Explanation:
SFTTrainer: Takes our model, dataset, LoRA config, tokenizer, and training arguments.dataset_text_field="text": Crucially tells the trainer which column in ourDatasetcontains the text we want to train on.max_seq_length: The maximum length of sequences fed to the model. Longer sequences require more memory. You might need to reduce this if you run into OOM errors.packing=False: IfTrue,SFTTrainertries to concatenate multiple short examples into a single, longer sequence to make better use of GPU memory. For our tiny dataset,Falseis fine, but for larger datasets with many short texts,Truecan be a significant optimization.
Step 6: Train the Model
The moment of truth!
Continue adding to finetune_llm.py:
# --- 6. Train the Model ---
print("Step 6: Starting model training...")
trainer.train()
print("Model training complete!")
# --- 7. Save the Fine-tuned Adapter ---
print("Step 7: Saving the fine-tuned LoRA adapter...")
trainer.save_model(os.path.join(output_dir, "final_checkpoint"))
print(f"LoRA adapter saved to {os.path.join(output_dir, 'final_checkpoint')}")
Explanation:
trainer.train(): Kicks off the training loop. You’ll see a progress bar and logged metrics (loss, learning rate, etc.).trainer.save_model(): Saves the LoRA adapter weights to the specified directory. It does not save the entire base model, only the small, trained LoRA matrices. This is a huge advantage of PEFT – small checkpoints!
Step 8: Inference with the Fine-tuned Model
Now, let’s see how our fine-tuned model performs! We’ll load the base model, then load our LoRA adapter weights on top of it.
Continue adding to finetune_llm.py:
# --- 8. Inference with the Fine-tuned Model ---
print("\nStep 8: Performing inference with the fine-tuned model...")
# Load the base model again (without quantization for simplicity, or with if you prefer)
# For production, you'd likely load the base model with the same quantization
# and then add the adapter.
base_model_for_inference = AutoModelForCausalLM.from_pretrained(
model_name,
return_dict=True,
torch_dtype=torch.bfloat16, # Use bfloat16 for inference if possible
device_map="auto",
)
# Load the LoRA adapter weights
from peft import PeftModel
model_path = os.path.join(output_dir, "final_checkpoint")
fine_tuned_model = PeftModel.from_pretrained(base_model_for_inference, model_path)
# You can optionally merge the LoRA weights into the base model for faster inference
# This will create a new full model with the combined weights
# fine_tuned_model = fine_tuned_model.merge_and_unload()
# Get the tokenizer
inference_tokenizer = AutoTokenizer.from_pretrained(model_name)
inference_tokenizer.pad_token = inference_tokenizer.eos_token
inference_tokenizer.padding_side = "right"
# Test the fine-tuned model
def generate_response(instruction, model, tokenizer):
# Format the instruction exactly as during training
prompt = f"### Instruction:\n{instruction}\n\n### Response:"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Ensure inputs are on GPU
# Generate output
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=100, # Max length of the generated response
do_sample=True, # Enable sampling for more creative responses
top_k=50, # Consider only top 50 most probable tokens
top_p=0.95, # Nucleus sampling
temperature=0.7, # Controls randomness: lower means more deterministic
eos_token_id=tokenizer.eos_token_id, # Stop generation at EOS token
)
# Decode and print the response
response = tokenizer.decode(outputs[0][len(inputs["input_ids"][0]):], skip_special_tokens=True)
return response.strip()
# Test cases
test_instructions = [
"What is the capital of Japan?",
"Tell me a short story about a brave knight.",
"What is the chemical symbol for water?",
"What is the best way to learn programming?",
]
print("\n--- Testing Fine-Tuned Model ---")
for inst in test_instructions:
print(f"\nInstruction: {inst}")
response = generate_response(inst, fine_tuned_model, inference_tokenizer)
print(f"Response: {response}")
print("\n--- Testing Original Model (Optional, for comparison) ---")
# To compare, you'd load the original model and tokenizer without PEFT
# and run the same generate_response function.
# For simplicity, we'll skip loading the original model again here,
# but you can try it to see the difference!
Explanation:
AutoModelForCausalLM.from_pretrained(...): We load the original base model.PeftModel.from_pretrained(base_model_for_inference, model_path): This is where the magic happens! We load our tiny LoRA adapter weights andpeftintelligently applies them to the base model, creating aPeftModelready for inference.merge_and_unload(): An optional step. It merges the LoRA weights directly into the base model’s original weight matrices. This results in a full, modified model that no longer needs thepeftwrapper, potentially offering slightly faster inference or easier deployment, but it makes the model larger.generate_responsefunction:- It formats the prompt using the exact same template used during training. This is absolutely critical for the model to understand the instruction.
- It uses
model.generate()with various generation parameters (max_new_tokens,do_sample,temperature,top_k,top_p) to control the quality and creativity of the generated text. - It decodes the generated tokens and extracts only the new response.
Mini-Challenge
Challenge: Experiment with different LoRA configurations.
- Change
randlora_alpha: InLoraConfig, try settingr=8andlora_alpha=16, orr=32andlora_alpha=64. - Modify
target_modules: Try fine-tuning only the attention projection layers (["q_proj", "k_proj", "v_proj", "o_proj"]) or adding all linear layers you can find usingmodel.named_modules(). - Adjust
num_train_epochs: Try training for just1epoch or for5epochs.
Hint:
Remember that r directly impacts the number of trainable parameters. A smaller r means fewer parameters, faster training, and less memory, but might limit the model’s ability to adapt. A larger r might capture more nuances but requires more resources.
What to observe/learn:
- How does changing
raffect the “Trainable params” reported bymodel.print_trainable_parameters()? - Does increasing
rlead to better responses for your specific task, or does it overfit the small dataset? - How does the training time change with different
rvalues? - Do you notice any difference in the quality of the generated responses when you change the number of epochs or target modules?
Common Pitfalls & Troubleshooting
Out of Memory (OOM) Errors:
- Symptom: Your script crashes with a message like
CUDA out of memory. - Solution:
- Reduce
per_device_train_batch_size: This is the first thing to try. Make it as small as 1 if necessary. - Increase
gradient_accumulation_steps: If you reducebatch_size, compensate by increasinggradient_accumulation_stepsto maintain a similar effective batch size. - Reduce
max_seq_length: Shorter sequences consume less memory. - Use
bitsandbytesquantization: Ensureload_in_4bit=Trueis correctly configured andbitsandbytesis installed. - Reduce
rinLoraConfig: Fewer LoRA parameters mean less memory for their gradients. - Close other GPU-intensive applications.
- Use a smaller base model: If Mistral 7B is too large, consider models like
TinyLlama/TinyLlama-1.1B-Chat-v1.0ormicrosoft/phi-2.
- Reduce
- Symptom: Your script crashes with a message like
Poor Model Performance / Model Hallucinating:
- Symptom: The fine-tuned model doesn’t follow instructions well, generates nonsensical responses, or doesn’t improve over the base model.
- Solution:
- Data Quality and Quantity: Is your dataset truly high-quality and representative of the task? For real-world tasks, 10 examples are far too few. Aim for hundreds or thousands of diverse, well-formatted examples.
- Instruction Formatting: Double-check that your training data format (
### Instruction:\n...\n\n### Response:) is exactly matched during inference. This is a common mistake. - Hyperparameter Tuning: Experiment with
learning_rate,num_train_epochs,lora_alpha,r, andlora_dropout. These are critical. target_modulesin LoRA: Ensure you’re fine-tuning the most relevant layers. For Mistral, the provided list is a good starting point.max_new_tokensingenerate(): If responses are too short, increase this.- Generation Parameters: Adjust
temperature,top_k,top_p. Highertemperaturemeans more creative but potentially less coherent output.
Installation Issues:
- Symptom:
pip installerrors,ModuleNotFoundError. - Solution:
- Virtual Environment: Always use a virtual environment to avoid dependency conflicts.
- PyTorch CUDA: Ensure you’ve installed the correct PyTorch version for your CUDA toolkit. Check
nvcc --versionfor your CUDA version and use thepytorch.orginstructions. bitsandbytes: This library can sometimes be tricky. Ensure your CUDA drivers are up to date. If issues persist, try installingbitsandbytesfrom source or using a specific pre-compiled wheel for your CUDA version if available.- Version Compatibility: While
~=helps, sometimes specific minor versions might have issues. Check GitHub issues for the libraries if you encounter persistent problems.
- Symptom:
Summary
Congratulations! You’ve successfully navigated the complex world of LLM fine-tuning, applied Parameter-Efficient Fine-Tuning (PEFT) with LoRA, and specialized a powerful pre-trained model for a custom task.
Here are the key takeaways from this chapter:
- Fine-tuning adapts pre-trained LLMs to specific tasks, leveraging their vast general knowledge.
- Full fine-tuning is often too resource-intensive due to the immense size of LLMs.
- Parameter-Efficient Fine-Tuning (PEFT) techniques, like LoRA, dramatically reduce computational requirements by training only a small fraction of parameters.
- LoRA injects low-rank matrices into attention layers, allowing efficient adaptation.
- The Hugging Face ecosystem (
transformers,peft,trl,datasets,bitsandbytes,accelerate) provides the essential tools for this process. - 4-bit quantization with
bitsandbytesis crucial for fitting large models on consumer GPUs. - Data quality and format (especially instruction-response templates) are critical for effective fine-tuning.
- Hyperparameter tuning (e.g.,
r,lora_alpha,learning_rate,epochs) significantly impacts performance. - Inference requires loading the base model and then applying the trained LoRA adapter.
You now have a foundational understanding and hands-on experience with one of the most important techniques in modern AI development. This skill is highly sought after and opens doors to building truly custom, powerful AI applications.
What’s next? In the upcoming chapters, we’ll continue to build on this expertise, exploring more advanced PEFT methods, evaluating models more rigorously, and delving into how to deploy these fine-tuned models for real-world use cases. You’re well on your way to becoming a proficient AI/ML engineer!
References
- Hugging Face
transformersDocumentation - Hugging Face
peftDocumentation - Hugging Face
trlDocumentation - BitsAndBytes GitHub Repository
- PyTorch Get Started
- Mistral 7B Instruct v0.2 Model Card
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.