Introduction
Welcome to Chapter 11! So far, you’ve mastered the fundamentals of setting up Tunix, loading models, and initiating basic post-training runs. But what if the standard tools aren’t quite enough for your specific research or application? What if you need to guide your Language Model (LLM) with a unique objective, fine-tune its learning process with a specialized algorithm, or automate complex actions during training?
This chapter is your gateway to unlocking the full power of Tunix customization. We’ll dive deep into how you can define and integrate your own loss functions to precisely shape your LLM’s learning objective, craft sophisticated optimizers using JAX’s powerful Optax library to control parameter updates, and implement intelligent callbacks to monitor, control, and react to your training process. By the end of this chapter, you’ll be able to tailor Tunix to virtually any LLM post-training scenario, moving beyond off-the-shelf solutions to truly bespoke training pipelines.
To get the most out of this chapter, you should be comfortable with the basic Tunix training loop concepts covered in previous chapters, have a foundational understanding of JAX, and be familiar with common machine learning concepts like loss functions and optimizers. Let’s get started on making Tunix truly yours!
Core Concepts: The Pillars of Customization
Tunix, being built on JAX, inherits JAX’s flexibility and composability. This “white-box” design, as Google describes it, means you have granular control over every aspect of your training. We’ll focus on three key areas: loss functions, optimizers, and callbacks.
Understanding Loss Functions in LLM Post-Training
At its heart, machine learning is about minimizing errors. A loss function is the mathematical engine that quantifies this error. It measures how “wrong” your model’s predictions are compared to the true targets. During training, the optimizer uses the gradient of this loss to adjust the model’s parameters, iteratively making the model “less wrong.”
What is a Loss Function?
Imagine you’re teaching a child to identify fruits. If they say “apple” when shown an orange, that’s an error. The loss function is like a scorekeeper that gives a higher penalty for bigger mistakes (e.g., saying “rock” for an orange) and a smaller penalty for minor ones (e.g., saying “tangerine” for an orange).
For LLMs, common tasks like next-token prediction often use Cross-Entropy Loss. However, in post-training, especially for alignment techniques like Reinforcement Learning from Human Feedback (RLHF), you might encounter more complex losses like Kullback-Leibler (KL) Divergence to penalize divergence from a reference model, or custom losses designed for specific safety or factual consistency objectives.
Why Customize Loss Functions?
- Task Specificity: Default losses might not perfectly align with your unique post-training goal. For example, you might want to penalize certain types of hallucinations more heavily.
- Robustness: Custom losses can be designed to be more robust to noisy data or outliers.
- Regularization: You can add regularization terms directly into your loss function to prevent overfitting or encourage desired model properties.
- Multi-objective Optimization: Combine multiple objectives (e.g., fluency, coherence, safety) into a single composite loss.
How Tunix Integrates Custom Losses
Tunix, like many JAX-based libraries, expects loss functions to be pure Python functions that operate on JAX arrays. The key is that they must be differentiable, as JAX will automatically compute their gradients.
Figure 11.1: Simplified data flow for loss function and gradient computation.
Mastering Optimizers with Optax
An optimizer is the algorithm that adjusts your model’s internal parameters (weights and biases) based on the gradients computed from the loss function. It’s the “how” of learning.
What is an Optimizer?
If the loss function tells you how far off you are, the optimizer tells you which way to go and how big a step to take to reduce that error. Imagine you’re blindfolded on a mountain, trying to find the lowest point. The loss function tells you your current elevation. The optimizer tells you which direction is downhill and how far to step based on the steepness.
Common optimizers include Stochastic Gradient Descent (SGD), Adam, and AdamW. Optax is JAX’s library for gradient processing and optimization, providing a highly modular and composable way to build custom optimizers.
Why Customize Optimizers?
- Learning Rate Schedules: Dynamically change the learning rate over time (e.g., warm-up, decay) for more stable and effective training.
- Advanced Algorithms: Use specialized optimizers or combine multiple optimization techniques (e.g., gradient clipping + AdamW).
- Memory Efficiency: Some optimizers are more memory-efficient for very large models.
- Hyperparameter Tuning: Fine-tune optimizer behavior for optimal performance on your specific task.
How Tunix Leverages Optax
Tunix integrates seamlessly with Optax. You define your optimizer using Optax’s building blocks, and Tunix uses this Optax optimizer state to manage parameter updates within its training loop. This allows for immense flexibility without rewriting core training logic.
Enhancing Training with Callbacks
Callbacks are functions or objects that can be executed at specific points during the training process. They allow you to inject custom logic without modifying the core training loop.
What are Callbacks?
Think of callbacks as event listeners for your training process. When a certain event happens (e.g., an epoch ends, a batch finishes, training starts), your callback can “listen” for that event and perform a predefined action.
Why Use Callbacks?
- Logging: Record metrics, gradients, or other data to a file or a visualization tool (e.g., TensorBoard, Weights & Biases).
- Early Stopping: Automatically stop training if the model’s performance on a validation set stops improving, preventing overfitting and saving compute.
- Model Checkpointing: Save the model’s weights at regular intervals or when a new best performance is achieved.
- Learning Rate Scheduling: Adjust the learning rate based on validation metrics.
- Custom Metrics: Compute and log metrics not natively handled by the training loop.
- Dynamic Adjustments: Modify training parameters or even model architecture mid-training (though this is more advanced).
Tunix’s Callback System
Tunix provides a flexible callback system, allowing you to define classes with methods that correspond to various lifecycle events of the training process (e.g., on_train_begin, on_step_end, on_epoch_end).
Step-by-Step Implementation
Let’s put these concepts into practice. We’ll assume you have a basic Tunix setup with a model and dataset ready, similar to what you’d have from Chapter 3 or 4. For demonstration, we’ll use a simplified training loop structure.
Prerequisites: Tunix and JAX Setup (as of 2026-01-30)
First, ensure you have Tunix and its dependencies installed.
# It's always a good idea to use a virtual environment
python -m venv tunix_env
source tunix_env/bin/activate # On Windows: .\tunix_env\Scripts\activate
# Install Tunix from its official GitHub repository for the latest stable version
# As of 2026-01-30, we'll assume a stable release like v0.2.0 or newer.
# Always check the official Tunix GitHub for the absolute latest stable release.
pip install "tunix[full] @ git+https://github.com/google/[email protected]"
# Verify JAX, Flax, Optax versions (these will be installed as Tunix dependencies)
# JAX: ~0.4.23 or newer
# Optax: ~0.1.7 or newer
# Flax: ~0.7.5 or newer
pip show jax flax optax
Note: The Tunix version v0.2.0 is a placeholder for a stable release by early 2026. Always refer to the official Tunix GitHub releases page for the most current stable tag or branch.
1. Defining a Custom Loss Function
Let’s create a custom loss function that’s a slight modification of standard cross-entropy, perhaps adding a small L2 regularization to the model’s weights directly within the loss calculation (though typically L2 is handled by the optimizer or a separate regularization term, this illustrates a custom composite loss).
We’ll define a function that takes logits, labels, and model params as input.
import jax
import jax.numpy as jnp
import optax
import flax.linen as nn
from tunix.trainer import Trainer # Assuming Tunix Trainer structure
from tunix.models import Transformer # Example Tunix model
# --- 1. Define a Custom Loss Function ---
def custom_llm_loss(logits: jnp.ndarray, labels: jnp.ndarray, params: flax.core.FrozenDict, l2_reg_factor: float = 1e-4) -> jnp.ndarray:
"""
Computes a custom loss for LLM post-training, combining cross-entropy
with L2 regularization on model parameters.
Args:
logits: The model's output logits (raw predictions).
labels: The true target labels (e.g., next token IDs).
params: The model's parameters, used for L2 regularization.
l2_reg_factor: The strength of the L2 regularization.
Returns:
A scalar JAX array representing the total loss.
"""
# Standard Cross-Entropy Loss
# We assume labels are token IDs and logits are for each token in vocabulary
one_hot_labels = jax.nn.one_hot(labels, num_classes=logits.shape[-1])
# Optax's softmax_cross_entropy_with_integer_labels is robust
ce_loss = optax.softmax_cross_entropy_with_integer_labels(logits=logits, labels=labels).mean()
# L2 Regularization on parameters
l2_loss = 0.0
for key in jax.tree_util.tree_leaves(params):
# We only apply L2 to array-like parameters (weights, biases)
if isinstance(key, jnp.ndarray) and key.ndim > 0:
l2_loss += jnp.sum(key**2)
total_loss = ce_loss + l2_reg_factor * l2_loss
return total_loss
Explanation:
- We import
jax,jax.numpy,optax, andflax.linenas these are fundamental for JAX-native operations. custom_llm_losstakeslogits(model predictions),labels(true values), andparams(model weights for regularization) as input.jax.nn.one_hotconverts integer labels into one-hot encoded vectors, which is useful for some loss formulations, thoughoptax.softmax_cross_entropy_with_integer_labelscan directly take integer labels.optax.softmax_cross_entropy_with_integer_labels: This is a robust way to compute cross-entropy loss in JAX. We take its mean across the batch.- L2 Regularization: We iterate through the
paramstree (which is a nested structure of model weights). For each numerical array, we compute the sum of its squares and add it tol2_loss. - Finally, we combine
ce_lossandl2_lossusingl2_reg_factorto gettotal_loss.
2. Implementing a Custom Optimizer with Optax
Now, let’s build a custom optimizer using Optax. We’ll combine AdamW with a linear warm-up followed by a cosine decay learning rate schedule.
# --- 2. Implement a Custom Optimizer with Optax ---
def create_custom_optimizer(
learning_rate: float,
total_steps: int,
warmup_steps: int,
weight_decay: float = 1e-1
) -> optax.GradientTransformation:
"""
Creates a custom Optax optimizer with AdamW and a combined learning rate schedule.
Args:
learning_rate: The peak learning rate.
total_steps: Total number of training steps.
warmup_steps: Number of steps for linear warm-up.
weight_decay: L2 regularization strength for AdamW.
Returns:
An optax.GradientTransformation object.
"""
# 2.1. Define the Learning Rate Schedule
# Linear warm-up
warmup_fn = optax.linear_schedule(
init_value=0.0,
end_value=learning_rate,
transition_steps=warmup_steps
)
# Cosine decay after warm-up
decay_fn = optax.cosine_decay_schedule(
init_value=learning_rate,
decay_steps=total_steps - warmup_steps
)
# Combine the schedules
# The `transition_steps` for `join_schedules` is where the switch from warmup to decay happens
lr_schedule = optax.join_schedules(
schedules=[warmup_fn, decay_fn],
boundaries=[warmup_steps]
)
# 2.2. Define the Optimizer Chain
# We use optax.chain for combining multiple transformations
optimizer = optax.chain(
optax.clip_by_global_norm(1.0), # Gradient clipping to prevent exploding gradients
optax.adamw(learning_rate=lr_schedule, weight_decay=weight_decay) # AdamW with our schedule
)
return optimizer
# Example usage (you'd pass these to your Tunix Trainer)
# peak_lr = 1e-4
# total_training_steps = 10000
# num_warmup_steps = 1000
# custom_optim = create_custom_optimizer(peak_lr, total_training_steps, num_warmup_steps)
Explanation:
create_custom_optimizertakes parameters likelearning_rate,total_steps,warmup_steps, andweight_decay.- Learning Rate Schedule:
optax.linear_schedulecreates a schedule that linearly increases the learning rate from0.0tolearning_rateoverwarmup_steps.optax.cosine_decay_schedulecreates a schedule that decays the learning rate fromlearning_rateto a small value using a cosine function over the remaining steps.optax.join_schedulescombines these two, switching from the warm-up to the decay schedule atwarmup_steps.
- Optimizer Chain:
optax.chainallows you to compose multiple gradient transformations.optax.clip_by_global_norm(1.0)is a common practice for LLMs to prevent gradients from becoming too large, which can destabilize training.optax.adamwis a popular optimizer. We pass ourlr_scheduleandweight_decayto it.
3. Creating and Using a Custom Callback
Let’s define a custom callback that logs the average loss every N steps and saves a checkpoint if the validation loss improves.
import os
import time
from typing import Any, Dict, Optional
from tunix.trainer import TrainerCallback, TrainerState # Assuming Tunix provides these base classes
# --- 3. Creating and Using a Custom Callback ---
class CustomLoggerAndCheckpointCallback(TrainerCallback):
"""
A custom callback to log average loss periodically and save model checkpoints
based on improved validation loss.
"""
def __init__(self, log_interval_steps: int, checkpoint_dir: str = "./checkpoints",
monitor_metric: str = "val_loss", mode: str = "min"):
super().__init__()
self.log_interval_steps = log_interval_steps
self.checkpoint_dir = checkpoint_dir
self.monitor_metric = monitor_metric
self.mode = mode
self.best_metric_value = None
self.step_losses = []
os.makedirs(self.checkpoint_dir, exist_ok=True)
print(f"CustomLoggerAndCheckpointCallback initialized. Checkpoints will be saved to: {self.checkpoint_dir}")
def on_train_begin(self, state: TrainerState, **kwargs: Any) -> None:
"""Called at the beginning of training."""
print("Training started! Initializing custom callback.")
self.best_metric_value = float('inf') if self.mode == 'min' else -float('inf')
def on_step_end(self, state: TrainerState, **kwargs: Any) -> None:
"""Called at the end of each training step."""
# Assuming Tunix TrainerState has 'current_step' and 'loss_value'
current_step = state.current_step
current_loss = state.loss_value
self.step_losses.append(current_loss)
if (current_step + 1) % self.log_interval_steps == 0:
avg_loss = jnp.mean(jnp.array(self.step_losses)).item()
print(f"Step {current_step + 1}/{state.total_steps} - Average Loss ({self.log_interval_steps} steps): {avg_loss:.4f}")
self.step_losses = [] # Reset for next interval
def on_epoch_end(self, state: TrainerState, logs: Dict[str, Any], **kwargs: Any) -> None:
"""Called at the end of each epoch."""
# Check if the monitored metric is available in logs
if self.monitor_metric in logs:
current_metric_value = logs[self.monitor_metric]
print(f"Epoch {state.current_epoch} - {self.monitor_metric}: {current_metric_value:.4f}")
should_save = False
if self.mode == 'min':
if current_metric_value < self.best_metric_value:
self.best_metric_value = current_metric_value
should_save = True
elif self.mode == 'max':
if current_metric_value > self.best_metric_value:
self.best_metric_value = current_metric_value
should_save = True
if should_save:
checkpoint_path = os.path.join(self.checkpoint_dir, f"model_epoch_{state.current_epoch:03d}_{self.monitor_metric}_{self.best_metric_value:.4f}.tunix")
# Assuming Tunix Trainer has a save_model method
# state.trainer.save_model(state.params, checkpoint_path) # This is hypothetical, depends on Tunix API
print(f"Saving best model checkpoint to {checkpoint_path}")
# In a real Tunix implementation, you would call a method on the Trainer
# or pass the state.params to a saving utility.
# For demonstration, let's just print a placeholder save.
# Example: state.trainer.save_checkpoint(state.params, state.opt_state, checkpoint_path)
print(f"Would save model state here to {checkpoint_path}")
else:
print(f"Warning: Monitored metric '{self.monitor_metric}' not found in logs for epoch {state.current_epoch}.")
def on_train_end(self, state: TrainerState, **kwargs: Any) -> None:
"""Called at the end of training."""
print("Training ended! Custom callback finished.")
# --- Integration Example ---
# Assuming you have a Tunix Trainer instance setup
# trainer = Trainer(...)
# # Instantiate your custom callback
# my_callback = CustomLoggerAndCheckpointCallback(
# log_interval_steps=50,
# checkpoint_dir="./my_llm_checkpoints",
# monitor_metric="val_loss", # This metric needs to be computed and logged by Tunix Trainer
# mode="min"
# )
# # Add the callback to your Tunix Trainer
# trainer.add_callback(my_callback)
# # Then you would call trainer.train(...)
Explanation:
- We define
CustomLoggerAndCheckpointCallbackwhich inherits fromtunix.trainer.TrainerCallback. __init__: Sets up logging interval, checkpoint directory, and the metric to monitor.on_train_begin: A simple message indicating the start of training and initializingbest_metric_value.on_step_end: This method is called after each training step. We collect the loss for the current step and, iflog_interval_stepshave passed, calculate and print the average loss.on_epoch_end: Called at the end of each epoch. It checks themonitor_metricfrom thelogsdictionary provided by the TunixTrainer. If the metric shows improvement (e.g.,val_lossdecreases), it prints a message indicating a checkpoint would be saved.- Note: The actual
save_modelorsave_checkpointcall is hypothetical and depends on the exact Tunix API for saving model states. You would typically call a method on thetrainerobject itself or use a utility function provided by Tunix.
- Note: The actual
on_train_end: A final message when training concludes.
Integrating Custom Components into Tunix
Now, let’s imagine a simplified Tunix Trainer setup to see how these custom components would be plugged in.
# Assuming you have a model, dataset, and basic Tunix setup
# For this example, we'll use placeholder classes.
# Placeholder for a Tunix-compatible model
class MyTunixModel(nn.Module):
num_heads: int = 8
num_layers: int = 4
vocab_size: int = 1000
embed_dim: int = 256
@nn.compact
def __call__(self, x: jnp.ndarray, train: bool = True):
# Simplified transformer block for demonstration
x = nn.Embed(num_embeddings=self.vocab_size, features=self.embed_dim)(x)
for _ in range(self.num_layers):
x = nn.SelfAttention(num_heads=self.num_heads)(x)
x = nn.Dense(features=self.embed_dim)(x)
x = nn.LayerNorm()(x)
logits = nn.Dense(features=self.vocab_size)(x)
return logits
# Placeholder for a dataset (e.g., a simple iterator)
class DummyDataset:
def __init__(self, num_batches: int = 100, batch_size: int = 4, seq_len: int = 64, vocab_size: int = 1000):
self.num_batches = num_batches
self.batch_size = batch_size
self.seq_len = seq_len
self.vocab_size = vocab_size
def __iter__(self):
for _ in range(self.num_batches):
# Simulate input tokens and target labels
inputs = jnp.array(jax.random.randint(jax.random.PRNGKey(0), (self.batch_size, self.seq_len), 0, self.vocab_size))
labels = jnp.array(jax.random.randint(jax.random.PRNGKey(1), (self.batch_size, self.seq_len), 0, self.vocab_size))
yield {"input_ids": inputs, "labels": labels}
def __len__(self):
return self.num_batches
# Initialize a JAX PRNGKey
key = jax.random.PRNGKey(42)
model_key, dropout_key = jax.random.split(key)
# Instantiate your custom components
peak_lr = 1e-4
total_training_steps = 1000 # For a short demo
num_warmup_steps = 100
custom_optim = create_custom_optimizer(peak_lr, total_training_steps, num_warmup_steps)
my_callback = CustomLoggerAndCheckpointCallback(
log_interval_steps=10,
checkpoint_dir="./my_llm_checkpoints_demo",
monitor_metric="val_loss", # This metric needs to be computed and logged by Tunix Trainer
mode="min"
)
# Initialize model and parameters
dummy_input = jnp.ones((1, 64), dtype=jnp.int32) # batch_size=1, seq_len=64
model = MyTunixModel()
params = model.init(model_key, dummy_input)['params'] # Initialize only 'params'
# Create a dummy Tunix Trainer (this is a simplified representation)
class SimplifiedTunixTrainer:
def __init__(self, model_module: nn.Module, params: flax.core.FrozenDict,
optimizer: optax.GradientTransformation, loss_fn: callable,
train_dataset: Any, val_dataset: Any = None, callbacks: Optional[list] = None):
self.model_module = model_module
self.params = params
self.optimizer = optimizer
self.opt_state = optimizer.init(params)
self.loss_fn = loss_fn
self.train_dataset = train_dataset
self.val_dataset = val_dataset
self.callbacks = callbacks if callbacks is not None else []
self.current_step = 0
self.current_epoch = 0
self.total_steps = len(train_dataset) # Simplified
self.rng_key = jax.random.PRNGKey(0)
# Callbacks registration
for cb in self.callbacks:
cb.trainer = self # Allow callbacks to interact with trainer if needed
@jax.jit
def train_step(self, params, opt_state, batch, rng_key):
input_ids = batch["input_ids"]
labels = batch["labels"]
def compute_loss(params):
logits = self.model_module.apply({'params': params}, input_ids, train=True, rngs={'dropout': rng_key})
# Pass params to custom_llm_loss for regularization calculation
loss = self.loss_fn(logits, labels, params)
return loss, logits # Return logits if needed for other metrics
# Compute gradient of the loss with respect to parameters
(loss_value, logits), grads = jax.value_and_grad(compute_loss, has_aux=True)(params)
# Apply gradients
updates, opt_state = self.optimizer.update(grads, opt_state, params)
params = optax.apply_updates(params, updates)
return params, opt_state, loss_value, logits
def train(self, num_epochs: int):
trainer_state = TrainerState(
current_step=0,
current_epoch=0,
total_steps=self.total_steps * num_epochs,
params=self.params,
opt_state=self.opt_state,
rng_key=self.rng_key,
loss_value=0.0 # Will be updated
)
# Call on_train_begin for all callbacks
for cb in self.callbacks:
cb.on_train_begin(trainer_state)
for epoch in range(num_epochs):
self.current_epoch = epoch
print(f"\n--- Epoch {epoch + 1}/{num_epochs} ---")
epoch_losses = []
for i, batch in enumerate(self.train_dataset):
trainer_state.current_step = self.current_step
trainer_state.current_epoch = self.current_epoch
# Split RNG key for dropout and other random operations
step_key, dropout_key = jax.random.split(trainer_state.rng_key)
trainer_state.rng_key = step_key # Update trainer state's rng_key
self.params, self.opt_state, loss_value, logits = self.train_step(
self.params, self.opt_state, batch, dropout_key
)
trainer_state.params = self.params
trainer_state.opt_state = self.opt_state
trainer_state.loss_value = loss_value
epoch_losses.append(loss_value)
# Call on_step_end for all callbacks
for cb in self.callbacks:
cb.on_step_end(trainer_state, logs={"loss": loss_value.item()}) # .item() to get scalar Python float
self.current_step += 1
# Simulate validation loss computation for callback
val_loss = jnp.mean(jnp.array(epoch_losses)).item() * 0.9 # Just for demo, usually on val_dataset
logs = {"loss": jnp.mean(jnp.array(epoch_losses)).item(), "val_loss": val_loss}
# Call on_epoch_end for all callbacks
for cb in self.callbacks:
cb.on_epoch_end(trainer_state, logs=logs)
# Call on_train_end for all callbacks
for cb in self.callbacks:
cb.on_train_end(trainer_state)
# Instantiate the dummy dataset
train_data = DummyDataset(num_batches=50, batch_size=4, seq_len=64, vocab_size=1000)
# Instantiate the simplified Tunix Trainer
simplified_trainer = SimplifiedTunixTrainer(
model_module=model,
params=params,
optimizer=custom_optim,
loss_fn=custom_llm_loss, # Our custom loss function
train_dataset=train_data,
callbacks=[my_callback] # Our custom callback
)
# Run the training
# simplified_trainer.train(num_epochs=2)
print("\n--- Custom Tunix Training Setup Complete ---")
print("To run the demo, uncomment `simplified_trainer.train(num_epochs=2)` above.")
print("Observe how the custom loss, optimizer schedule, and callback logging/checkpointing interact.")
print("The output will show step-wise average loss and epoch-end validation metric monitoring.")
Explanation of Integration:
- We’ve defined
MyTunixModelandDummyDatasetas placeholders to simulate a real Tunix environment. - A
SimplifiedTunixTrainerclass is created to show howmodel_module,params,optimizer,loss_fn, andcallbackswould be passed in. - The
train_stepmethod usesjax.value_and_gradto compute loss and gradients, thenoptimizer.updateandoptax.apply_updatesto update parameters, demonstrating the JAX/Optax integration. - The
trainmethod orchestrates the training loop, calling the appropriate callback methods (on_train_begin,on_step_end,on_epoch_end,on_train_end) at their respective points. - Crucially, our
custom_llm_lossis passed directly asloss_fn, andcustom_optimis passed asoptimizer. Ourmy_callbackis added to thecallbackslist. - Running
simplified_trainer.train(num_epochs=2)would execute this custom training flow.
Mini-Challenge: Enhancing the Callback
Your turn! Let’s enhance our custom callback.
Challenge: Modify the CustomLoggerAndCheckpointCallback to also include a simple early stopping mechanism. If the monitor_metric (e.g., val_loss) does not improve for a specified number of consecutive epochs (called patience), the callback should signal the trainer to stop training.
Hint:
- Add
patience: intandpatience_counter: int = 0to the callback’s__init__. - In
on_epoch_end, when the metric doesn’t improve, incrementpatience_counter. If it does improve, resetpatience_counterto 0. - If
patience_counterexceedspatience, you’ll need a way to stop the training. In a real Tunix Trainer, there might be atrainer.stop_training = Trueflag or a similar mechanism in theTrainerState. For ourSimplifiedTunixTrainer, you could raise a custom exception (StopTrainingException) that thetrainloop catches.
What to Observe/Learn:
- How callbacks can actively control the training flow, not just observe it.
- The interplay between different callback functionalities (logging, checkpointing, early stopping).
- The importance of
TrainerStatefor callbacks to access and potentially modify global training state.
Common Pitfalls & Troubleshooting
Customization, while powerful, can introduce new challenges. Here are a few common pitfalls:
Shape Mismatches in Custom Loss Functions:
- Pitfall: Your custom loss function expects
logitsorlabelsof a certain shape, but the model output or dataset provides something different. This often leads to JAXShapeErrororTypeError. - Troubleshooting: Print the shapes of
logitsandlabelsright at the start of yourcustom_llm_lossfunction.print(f"Logits shape: {logits.shape}, Labels shape: {labels.shape}"). Ensure they are compatible with your loss calculations (e.g.,(batch_size, seq_len, vocab_size)for logits and(batch_size, seq_len)for labels for token classification).
- Pitfall: Your custom loss function expects
Non-Differentiable Operations in Loss:
- Pitfall: You accidentally include an operation in your custom loss that JAX cannot differentiate (e.g., converting a JAX array to a Python float in the middle of a calculation, or using certain non-JAX NumPy functions).
- Troubleshooting: JAX will usually raise a clear error message like “Abstract tracer value encountered at …” or “Gradient of … is not defined.” Review your loss function for any non-JAX operations or explicit conversions. Stick to
jax.numpyfunctions.
Incorrect Optimizer Initialization or Schedule Logic:
- Pitfall: Your learning rate schedule might not be applied correctly, or the optimizer’s internal state (
opt_state) isn’t managed properly, leading toNaNlosses or stagnant training. - Troubleshooting:
- Plot your learning rate schedule: Create a dummy loop for
total_stepsand printlr_schedule(step)to visualize its progression. - Check
optax.chainorder: The order of transformations matters (e.g., gradient clipping usually comes before applying the main optimizer). - Ensure
optimizer.init(params)is called once at the start andoptimizer.updateis called correctly in each step.
- Plot your learning rate schedule: Create a dummy loop for
- Pitfall: Your learning rate schedule might not be applied correctly, or the optimizer’s internal state (
Callback Side Effects and State Management:
- Pitfall: Your callback modifies
TrainerStatein an unexpected way, or it relies on state that isn’t guaranteed to be present or updated at its execution point. For instance, trying to accessval_lossinon_step_endwhen it’s only computedon_epoch_end. - Troubleshooting:
- Be explicit about what
TrainerStateattributes your callback uses. - Print the
logsdictionary passed toon_epoch_endto see what metrics are actually available. - Minimize side effects. If a callback needs to modify shared state, ensure it’s done safely and predictably.
- Be explicit about what
- Pitfall: Your callback modifies
Summary
Phew! You’ve just taken a massive leap in your Tunix journey. In this chapter, we’ve explored the critical avenues for customizing your LLM post-training pipeline:
- Loss Functions: You learned how to define custom, differentiable loss functions in JAX, combining standard objectives like cross-entropy with custom regularization or task-specific terms to precisely guide your model’s learning.
- Optimizers: We delved into Optax, JAX’s powerful optimizer library, demonstrating how to construct sophisticated optimizers with custom learning rate schedules (like warm-up and cosine decay) and gradient transformations (like global norm clipping).
- Callbacks: You mastered the art of creating custom callbacks to inject logic at various points in the training lifecycle, enabling features like periodic logging, conditional checkpointing, and even early stopping.
By understanding and applying these customization techniques, you’re no longer limited to off-the-shelf solutions. You can now design and implement highly specialized post-training routines tailored to the unique demands of your LLMs and research objectives.
What’s Next?
In the next chapter, we’ll build upon this foundation by exploring advanced model architectures and integration with external JAX/Flax components. Get ready to see how you can bring even more complex and custom models into the Tunix ecosystem!
References
- Tunix GitHub Repository
- Tunix Documentation (Read the Docs)
- JAX Official Documentation
- Optax GitHub Repository
- Flax Official Documentation
- Google AI Blog: Introducing Tunix
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.