Welcome back, future LLM master! In Chapter 3, we successfully set up our Tunix environment and explored its foundational components. Now, it’s time to put that knowledge into action and perform our very first model alignment task: Supervised Fine-Tuning (SFT).

This chapter is your hands-on guide to taking a pre-trained Large Language Model (LLM) and teaching it a new, specific skill using a carefully curated dataset. We’ll walk through everything from preparing your data to configuring Tunix’s powerful Trainer and observing your model learn. By the end, you’ll have a practical understanding of SFT and the confidence to apply it to your own projects. Get ready to make some LLMs smarter!

Core Concepts: Understanding Supervised Fine-Tuning (SFT)

Before we dive into code, let’s solidify our understanding of what SFT is and why it’s such a crucial step in the LLM lifecycle.

What is Supervised Fine-Tuning?

Imagine you have a brilliant student who knows a lot about many subjects but isn’t specialized in any particular one. That’s your pre-trained LLM! It has learned the general patterns of language from vast amounts of text, making it good at predicting the next word in a sequence.

Now, you want this student to become an expert in, say, answering specific coding questions. You wouldn’t re-teach them everything from scratch. Instead, you’d provide them with many examples of coding questions and their correct answers, guiding them to focus their existing knowledge on this new task.

This process is precisely what Supervised Fine-Tuning (SFT) does for LLMs:

  1. Starts with a Pre-trained LLM: We leverage the immense general knowledge already encoded in a base model.
  2. Uses Labeled Data: We provide a dataset consisting of input-output pairs (e.g., (prompt, desired_response)). The “supervised” part comes from these explicit labels.
  3. Adapts to a Specific Task: The model’s weights are adjusted to minimize the difference between its predictions and the desired outputs in the SFT dataset. This “fine-tunes” its behavior towards the new task.

Why is SFT important? It’s often the first and most fundamental step in aligning an LLM. It allows us to:

  • Make a general-purpose model follow specific instructions.
  • Improve performance on domain-specific tasks (e.g., legal, medical, coding).
  • Change the model’s output style or format.

The SFT Dataset: The Fuel for Fine-Tuning

The quality and format of your SFT dataset are paramount. For SFT, your data typically consists of pairs, where an “input” (or prompt) is mapped to a “target” (or completion).

Consider this example:

"prompt": "What is the capital of France?",
"completion": "The capital of France is Paris."

Or for a more complex instruction-following scenario:

"prompt": "Instruction: Summarize the following text.\nText: The quick brown fox jumps over the lazy dog. This is a common pangram.\n\nSummary:",
"completion": "The text describes the pangram 'The quick brown fox jumps over the lazy dog'."

Common formats for SFT datasets include JSONL (JSON Lines), where each line is a self-contained JSON object, making it easy to stream and process large datasets.

Tunix’s Role: Efficient SFT with JAX

Tunix (Tune-in-JAX) is purpose-built to make this process efficient and scalable, especially on JAX-accelerated hardware like TPUs or powerful GPUs. It provides:

  • tunix.data.Dataset: A flexible way to load, process, and prepare your data for training. It handles tokenization, batching, and other data transformations.
  • tunix.Trainer: The core orchestrator for the fine-tuning process. It manages the training loop, optimizer, learning rate schedules, checkpointing, and evaluation.
  • JAX-native backend: Leveraging JAX’s jit compilation and pmap for distributed training means your SFT runs will be highly optimized.

Let’s visualize the SFT workflow with Tunix:

graph TD A[Pre-trained LLM] --> B{SFT Dataset}; B --> C[Tunix Data Processor]; C --> D[Tunix Trainer]; D --> E[Fine-tuned LLM]; E --> F[Inference/Deployment]; subgraph SFT Process C D end style A fill:#f9f,stroke:#333,stroke-width:2px style B fill:#bbf,stroke:#333,stroke-width:2px style C fill:#ccf,stroke:#333,stroke-width:2px style D fill:#ddf,stroke:#333,stroke-width:2px style E fill:#f9f,stroke:#333,stroke-width:2px style F fill:#bfb,stroke:#333,stroke-width:2px

This diagram illustrates how a pre-trained LLM is fed into the SFT process along with a specialized dataset. Tunix handles the data preparation and training, resulting in a fine-tuned model ready for specific tasks.

Step-by-Step Implementation: Your First SFT Model

It’s time to get our hands dirty! We’ll go through the process of setting up a simple SFT task using Tunix. For this example, we’ll fine-tune a small, pre-trained model to follow a specific instruction format.

Prerequisites:

  • You have a working Python environment with Tunix installed (as covered in Chapter 3).
  • You have jax, jaxlib, and transformers installed.
    • pip install "jax[cuda12_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html (adjust cuda12_pip for your CUDA version if using GPU)
    • pip install tunix transformers sentencepiece

Step 1: Prepare Your SFT Dataset

First, we need some data! We’ll create a tiny dataset of instruction-response pairs. In a real-world scenario, this would be a much larger file.

Let’s create a Python script named run_sft.py.

# run_sft.py

import json
import os

# 1. Our simple SFT dataset
sft_data = [
    {"prompt": "Tell me a fun fact about space.", "completion": "Did you know that a day on Venus is longer than a year on Venus?"},
    {"prompt": "What is the capital of Canada?", "completion": "The capital of Canada is Ottawa."},
    {"prompt": "Explain the concept of 'hello world' in programming.", "completion": "'Hello, World!' is a simple program often used to illustrate the basic syntax of a programming language. It typically just prints the text 'Hello, World!' to the console."},
    {"prompt": "Who wrote 'Romeo and Juliet'?", "completion": "William Shakespeare wrote 'Romeo and Juliet'."}
]

# Define a filename for our dataset
dataset_filename = "simple_sft_dataset.jsonl"

# Save the dataset to a JSONL file
print(f"Saving dataset to {dataset_filename}...")
with open(dataset_filename, "w") as f:
    for entry in sft_data:
        f.write(json.dumps(entry) + "\n")
print("Dataset saved successfully.")

# We'll continue adding code to this file.

Explanation:

  • We import json and os for file operations.
  • sft_data is a Python list of dictionaries, where each dictionary represents one example. Each example has a "prompt" (the input) and a "completion" (the desired output).
  • We then iterate through this list and write each dictionary as a JSON string on a new line into simple_sft_dataset.jsonl. This is the standard JSONL format.

Step 2: Load and Process Data with Tunix

Now that we have our dataset, we need to load it and prepare it for training using Tunix’s data utilities. This involves tokenization.

Let’s extend run_sft.py:

# run_sft.py (continued)

# ... (previous code for dataset creation) ...

import jax
import jax.numpy as jnp
from transformers import AutoTokenizer, FlaxAutoModelForCausalLM
from tunix import data as tunix_data
from tunix import Trainer, TrainState
from tunix.models.flax_llm import FlaxLLMForCausalLM
from tunix.optimizers import get_optimizer
from tunix.schedules import get_lr_schedule

# 2. Load a pre-trained model and tokenizer
# We'll use a small model for demonstration purposes to keep it fast.
# Tunix integrates well with Hugging Face models.
# As of 2026-01-30, 'google/flan-t5-small' is a good choice for a small, instructional model.
# Make sure to have `sentencepiece` installed for T5 models.
model_name = "google/flan-t5-small"
print(f"\nLoading model and tokenizer: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
# We need to instantiate a Flax model for JAX compatibility
model = FlaxAutoModelForCausalLM.from_pretrained(model_name)
# Tunix often expects its own wrapper around Flax models for certain functionalities
# For this basic SFT, we can directly use the Flax model, but for advanced features,
# you might wrap it: llm_model = FlaxLLMForCausalLM(model)
print("Model and tokenizer loaded.")

# A simple data processing function
def tokenize_function(examples):
    # Combine prompt and completion to form the full text
    full_text = [f"{p}\n{c}{tokenizer.eos_token}" for p, c in zip(examples["prompt"], examples["completion"])]
    # Tokenize the combined text
    tokenized = tokenizer(
        full_text,
        max_length=128, # A reasonable max length for our small examples
        truncation=True,
        padding="max_length",
        return_tensors="jax", # Important for JAX compatibility
    )
    # Tunix Trainer expects 'input_ids' and 'labels'
    # For causal LMs, input_ids are typically the labels shifted
    tokenized["labels"] = tokenized["input_ids"]
    return tokenized

# 3. Create a Tunix Dataset
print(f"\nLoading and processing dataset from {dataset_filename}...")
# Use tunix_data.load_jsonl for convenience
raw_dataset = tunix_data.load_jsonl(dataset_filename)

# Convert to a Tunix Dataset object
sft_dataset = tunix_data.Dataset(
    raw_dataset,
    tokenizer=tokenizer,
    tokenization_fn=tokenize_function,
    batch_size=2, # Small batch size for demonstration
    shuffle=True,
    drop_remainder=True,
    num_replicas=jax.device_count(), # For multi-device training if available
)
print("Dataset loaded and configured.")

Explanation:

  • We import necessary JAX, Hugging Face, and Tunix components.
  • model_name: We’re using google/flan-t5-small. This is a relatively small, instruction-tuned model from Hugging Face, perfect for quick experiments. It’s a FlaxAutoModelForCausalLM, meaning it’s compatible with JAX.
  • tokenizer = AutoTokenizer.from_pretrained(model_name): Loads the appropriate tokenizer for our model.
  • model = FlaxAutoModelForCausalLM.from_pretrained(model_name): Loads the pre-trained Flax model weights.
  • tokenize_function(examples): This is a crucial function.
    • It takes a batch of raw examples (dictionaries with “prompt” and “completion”).
    • It combines the prompt and completion into a single string, adding an eos_token (end-of-sequence) at the end of the completion. This teaches the model when to stop generating.
    • It then uses the tokenizer to convert these strings into numerical input_ids, padding them to max_length and truncating if longer. return_tensors="jax" ensures the output is JAX arrays.
    • Crucially, for causal language modeling, the labels for training are typically the same as the input_ids. The model learns to predict the next token given the previous ones.
  • raw_dataset = tunix_data.load_jsonl(dataset_filename): Tunix provides a helper to load JSONL files.
  • sft_dataset = tunix_data.Dataset(...): This initializes Tunix’s data pipeline.
    • raw_dataset: Our loaded data.
    • tokenizer: The tokenizer we loaded.
    • tokenization_fn: Our custom function to process the raw data into token IDs.
    • batch_size: How many examples to process at once. Keep this small for initial tests.
    • num_replicas: This is important for JAX; it tells Tunix how many accelerators (GPUs/TPUs) are available to distribute the batch across. jax.device_count() automatically detects this.

Step 3: Configure Tunix Trainer

The Trainer is the heart of the fine-tuning process. It brings together the model, data, optimizer, and learning rate schedule.

Continue editing run_sft.py:

# run_sft.py (continued)

# ... (previous code for data loading and processing) ...

# 4. Configure Tunix Trainer
print("\nConfiguring Tunix Trainer...")

# Define the optimizer
# Tunix provides `get_optimizer` for common optimizers like AdamW
optimizer = get_optimizer(
    name="adamw",
    learning_rate=get_lr_schedule(
        name="constant",
        initial_learning_rate=1e-4, # A common starting point for SFT
    ),
    weight_decay=0.01,
)

# Initialize the TrainState
# This holds the model parameters, optimizer state, and other training metadata.
# We need to provide a dummy input to initialize the model's parameters shape.
dummy_input = {
    "input_ids": jnp.zeros((sft_dataset.batch_size, sft_dataset.max_length), dtype=jnp.int32),
    "attention_mask": jnp.ones((sft_dataset.batch_size, sft_dataset.max_length), dtype=jnp.int32),
    "labels": jnp.zeros((sft_dataset.batch_size, sft_dataset.max_length), dtype=jnp.int32),
}
# Make sure dummy_input is replicated across devices
dummy_input = jax.tree_map(lambda x: jnp.array([x] * jax.device_count()), dummy_input)


# Tunix's TrainState requires a specific model wrapper or direct Flax model.
# Let's use FlaxLLMForCausalLM for full Tunix compatibility
llm_model = FlaxLLMForCausalLM(model)

# Initialize the TrainState with the model and optimizer
train_state = TrainState.create(
    apply_fn=llm_model.__call__,
    params=llm_model.params,
    tx=optimizer,
    # Need to pass a dummy input for parameter initialization
    **dummy_input
)

# Initialize the Trainer
trainer = Trainer(
    model=llm_model,
    train_dataset=sft_dataset,
    eval_dataset=None, # We're skipping evaluation for this simple example
    train_state=train_state,
    epochs=3, # Train for 3 epochs - a small number for quick demonstration
    max_steps_per_epoch=None, # Train on all data per epoch
    log_steps=1, # Log every step
    output_dir="./sft_output", # Directory to save checkpoints and logs
)
print("Trainer configured.")

Explanation:

  • Optimizer: We define an AdamW optimizer, a popular choice for deep learning, with a constant learning rate of 1e-4. Tunix provides get_optimizer and get_lr_schedule helpers.
  • TrainState: This is a JAX/Flax concept that holds the mutable state of your training process, including the model’s parameters and the optimizer’s state (e.g., momentum buffers).
    • We create dummy_input to help JAX infer the shapes of the model’s parameters during initialization. This is a common JAX pattern. We also replicate it for multi-device training.
    • llm_model = FlaxLLMForCausalLM(model): We wrap our Hugging Face Flax model with Tunix’s FlaxLLMForCausalLM. This wrapper ensures compatibility with Tunix’s training loop and methods.
    • TrainState.create(...): Initializes the TrainState with the model’s apply_fn (how the model processes inputs), initial parameters, and the optimizer (tx).
  • Trainer: We instantiate the Trainer with:
    • model: Our Tunix-wrapped LLM.
    • train_dataset: Our sft_dataset created earlier.
    • epochs: How many times to iterate over the entire dataset. 3 is very small, but good for a first run.
    • log_steps: How frequently to log training progress.
    • output_dir: Where checkpoints and logs will be saved.

Step 4: Run the Fine-Tuning

With everything configured, starting the training is a single line of code!

Add this to run_sft.py:

# run_sft.py (continued)

# ... (previous code for Trainer configuration) ...

# 5. Run the fine-tuning
print("\nStarting Supervised Fine-Tuning...")
trainer.train()
print("Fine-tuning complete!")

# 6. Save the fine-tuned model
output_model_path = os.path.join(trainer.output_dir, "final_sft_model")
print(f"\nSaving fine-tuned model to {output_model_path}...")
trainer.save_model(output_model_path)
print("Model saved.")

Explanation:

  • trainer.train(): This kicks off the entire training process. You’ll see logs printed to your console showing the loss decreasing (hopefully!).
  • trainer.save_model(output_model_path): After training, we save the learned model weights to a specified directory. This allows us to load and use it later.

Step 5: Inference with the Fine-Tuned Model

Now for the exciting part: let’s see if our model learned anything! We’ll load the fine-tuned model and try to generate responses.

Append to run_sft.py:

# run_sft.py (continued)

# ... (previous code for saving model) ...

# 7. Perform inference with the fine-tuned model
print("\nPerforming inference with the fine-tuned model...")

# Load the fine-tuned model
# We need to reload the Flax model and then wrap it with FlaxLLMForCausalLM
fine_tuned_model = FlaxAutoModelForCausalLM.from_pretrained(output_model_path)
fine_tuned_llm = FlaxLLMForCausalLM(fine_tuned_model)

# Function to generate a response
def generate_response(prompt_text, max_new_tokens=50):
    # Prepare the prompt for the model
    input_ids = tokenizer(prompt_text, return_tensors="jax").input_ids
    
    # Generate tokens
    # Note: Tunix's FlaxLLMForCausalLM might have a specific generate method,
    # or you can use the underlying Hugging Face model's generate method.
    # For simplicity, we'll use the Hugging Face model's method here.
    output_ids = fine_tuned_model.generate(
        input_ids,
        max_new_tokens=max_new_tokens,
        do_sample=True, # Use sampling for more varied outputs
        temperature=0.7, # Controls randomness
        top_k=50, # Limits the vocabulary to the top K most likely tokens
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id,
    )
    
    # Decode the generated tokens
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    
    # Post-process to remove the original prompt if it's included in the response
    # This might depend on how the model was trained and how the prompt was formatted.
    if prompt_text in response:
        return response[len(prompt_text):].strip()
    return response.strip()

# Test prompts
test_prompts = [
    "Tell me a fun fact about the ocean.",
    "What is the capital of France?",
    "Explain recursion in programming.",
    "Who discovered gravity?",
]

for prompt in test_prompts:
    print(f"\nPrompt: {prompt}")
    response = generate_response(prompt)
    print(f"Response: {response}")

print("\nInference complete.")

Explanation:

  • We reload the model from the output_model_path to ensure we’re using our fine-tuned weights.
  • generate_response function:
    • Takes a prompt_text.
    • Tokenizes the prompt.
    • Uses the fine_tuned_model.generate() method (from Hugging Face Transformers) to produce new tokens. We’re using do_sample=True and temperature, top_k for more creative and less deterministic outputs.
    • Decodes the generated output_ids back into human-readable text.
    • Includes a simple post-processing step to remove the original prompt from the response, which is often desirable.
  • We then test our fine-tuned model with a few prompts, including one that was in our training data (“What is the capital of Canada?”) and some new ones.

To run the full script, simply execute: python run_sft.py

You should observe the model providing more accurate or specific responses to the questions it was fine-tuned on, and potentially improved instruction following for similar new prompts.

Mini-Challenge: Tweak and Observe!

Great job completing your first SFT run! Now, let’s play around a bit to build intuition.

Challenge:

  1. Add two more unique prompt-completion pairs to your sft_data list in run_sft.py. Make them very specific, perhaps about a fictional character or a niche topic.
  2. Change the epochs parameter in the Trainer from 3 to 5.
  3. Re-run the script.
  4. Observe the new responses for your added prompts and for the existing test prompts. Did the model’s behavior change? Is it more accurate on the new specific facts?

Hint: Pay close attention to the loss values during training. Does it continue to decrease? Does the model seem to “memorize” the new facts better?

What to observe/learn:

  • How even a small increase in training data and epochs can influence a model’s ability to recall specific facts or follow instructions.
  • The trade-off between training time and model performance.
  • The general trend of loss during training (it should generally decrease, indicating learning).

Common Pitfalls & Troubleshooting

Fine-tuning can sometimes be tricky. Here are a few common issues you might encounter and how to approach them:

  1. “Out of Memory” (OOM) Errors:

    • Symptom: Your script crashes with a message like CUDA out of memory or ResourceExhaustedError.
    • Cause: The model, batch size, or max_length is too large for your GPU/TPU’s memory.
    • Solution:
      • Reduce batch_size: This is often the first step. Try 1 or 2.
      • Reduce max_length: If your sequences are very long, shortening max_length in tokenize_function can help.
      • Use a smaller model: If you’re using a massive LLM, consider a smaller variant for initial experiments.
      • Gradient Accumulation: For very large models, Tunix supports gradient accumulation (processing batches sequentially but updating weights less frequently) which can simulate larger batch sizes with less memory. (This is an advanced feature not covered in this chapter, but good to know).
  2. Model Not Learning / Underfitting:

    • Symptom: The training loss doesn’t decrease significantly, or the model’s responses are still generic after fine-tuning.
    • Cause:
      • Insufficient Data: Your SFT dataset might be too small or not diverse enough for the task.
      • Too Few Epochs: The model hasn’t had enough time to learn.
      • Learning Rate Too Low: The model’s updates are too small to make meaningful progress.
    • Solution:
      • Increase epochs: Give the model more training time.
      • Increase learning_rate: Experiment with values like 5e-5 or 1e-5.
      • Improve Dataset Quality/Quantity: This is often the most impactful solution. More high-quality, relevant data is key.
  3. Overfitting:

    • Symptom: The training loss goes very low, but the model performs poorly on new, unseen data (it just “memorizes” the training examples).
    • Cause:
      • Too Many Epochs: The model has learned the training data too well, including its noise.
      • Learning Rate Too High: Updates are too aggressive.
      • Small Dataset: Easy for the model to memorize a tiny dataset.
    • Solution:
      • Reduce epochs: Stop training earlier.
      • Reduce learning_rate: Make updates smaller.
      • Add Regularization: Techniques like weight decay (already included in our AdamW optimizer) help prevent overfitting. Tunix’s Trainer might offer more advanced regularization options.
      • Use an eval_dataset: This is crucial. If you provide an eval_dataset to the Trainer, it can track validation loss, allowing you to stop training when validation loss starts to increase (early stopping).

Summary

Congratulations on completing your first Supervised Fine-Tuning with Tunix! You’ve taken a significant step in understanding how to adapt powerful LLMs to your specific needs.

Here are the key takeaways from this chapter:

  • SFT’s Purpose: It’s the process of teaching a pre-trained LLM specific skills or behaviors using labeled input-output examples.
  • Dataset Importance: A well-structured dataset (often in JSONL format with prompt and completion pairs) is crucial for effective SFT.
  • Tunix Data Pipeline: tunix.data.Dataset and custom tokenization functions streamline data preparation, including tokenization, batching, and device replication.
  • Tunix Trainer: The tunix.Trainer orchestrates the entire fine-tuning process, managing the model, optimizer, learning rate, and training loop.
  • Practical Application: You’ve successfully prepared data, configured a trainer, run a fine-tuning job, and performed inference with your newly specialized LLM.

In the next chapter, we’ll delve deeper into more advanced fine-tuning techniques beyond basic SFT, exploring how to further align models with human preferences and complex instructions using methods like Reinforcement Learning from Human Feedback (RLHF). Get ready for more exciting challenges!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.