The 2023 GPU shortage, fueled by COVID-19 supply chain disruptions, crypto-mining surges, and the GenAI explosion, led to an industry-wide effort to develop resource-efficient training methods. Techniques like Low-Rank Adaptation (LoRA) gained massive popularity as teams sought to do more with less–in fact, that’s one of the primary techniques we use for fine-tuning.

GPU shortage drove innovations in model training efficiency

As GPU capacity has increased over the last year, there's now the opportunity to prioritize training speed over efficiency. The benefits of prioritizing speed are two-fold:

Faster training means faster iteration cycles and, ultimately, less time spent developing high-performance models.
Faster training can result in a significant reduction in GPU costs.

To help our customers accelerate training, we applied dozens of optimizations to our LLM fine-tuning stack, striking a careful balance between training speed and efficient use of resources. The result: a 15x increase in training speed while maintaining a cost-efficient approach [as an example, we fine-tuned 100s of high-performing LLMs for $8 on average: see our Fine-tuning Index]. We recently discussed our fine-tuning optimizations on the webinar, How we accelerated fine-tuning by 15x in just 15 days. In this blog, we’ll walk through the 9 critical steps we took to achieve such a dramatic speedup, including key strategies, metrics and insights.

Nine Steps to Accelerate Your Fine-tuning Jobs

Fine-tuning is one of the most cost-effective methods to generate lightweight, task-specific language models tailored to your data. Two key techniques are Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA). LoRA is a Parameter Efficient Fine Tuning (PEFT) method that optimizes only a fraction of the model's weights, thereby also avoiding catastrophic forgetting. QLoRA further enhances performance by adding a quantization step. These approaches enable rapid adaptation of large pre-trained models to specific domains or tasks while significantly reducing computational and memory requirements. Read our blog on LoRA Adapters to learn more about these techniques.

Before applying any optimization beyond QLoRA, fine-tuning training times varied significantly depending on the dataset. Smaller datasets and those with fewer tokens per example took several hours, while more challenging datasets like Magicoder and Drop required multiple days of training.

Step 1: Upgrade Hardware From A10Gs to A100s

The A100 GPUs offer significant performance improvements over the previous generation A10Gs. They have 50% more tensor cores and 2.5x the raw compute power in terms of Floating Point 16 (FP16) or Brain Floating Point 16 (BF16) operations per second. By upgrading from A10Gs to A100s, we achieved an average speed improvement of about 60% across our benchmarks.

However, the magnitude of the performance boost varied depending on the dataset size and complexity. Smaller and simpler datasets like Viggo and Glue saw more modest gains compared to larger, more computationally demanding workloads - as illustrated in the bar graph below. To streamline our analysis, we'll focus on results from four representative datasets that capture a range of scales in terms of number of rows and tokens per row. These selected datasets highlight how the A100's cutting-edge architecture and optimizations yield the greatest benefits when tackling substantial, complex datasets.

Training speed improvement by simply upgrading hardware from A10 to A100

Step 2: Moving from Paged Optimizers to Fused Optimizers

The original QLoRA paper recommends using the Paged Adam optimizer, which leverages NVIDIA’s Unified Memory architecture to offload optimizer states from GPU to CPU memory, preventing GPU's out-of-memory (OOM) errors. However, this paging mechanism introduces significant memory access overhead, particularly when training with larger batch sizes and long context lengths, as the optimizer state must be repeatedly transferred between CPU and GPU.

Compared to the Paged Adam optimizer, the standard Adam optimizer delivered a 10% speedup, while the Fused AdamW optimizer achieved a remarkable 39% average performance gain.

Regular Adam optimizer improves speeds by 10% and Fused AdamW by 39% on average.

Fused AdamW combines gradient computation, moving average updates, and corrected moments computation into a single step through a fused kernel, eliminating redundant operations and memory accesses. Moreover, fused optimizers batch parameter updates, enabling efficient parallelization and better utilization of computational resources. This results in faster weight updates and accelerated training.

Step 3: Automatic Batch Size Tuning

While larger batch sizes can significantly accelerate training by leveraging the GPU's parallel processing capabilities, they also increase the risk of OOM errors, especially when dealing with longer input sequences. On the other hand, setting the batch size to 1 eliminates OOM risks, but severely underutilizes the GPU's potential.

To strike an optimal balance, we implemented a dynamic batch size tuning mechanism that automatically creates the largest possible batches within the GPU's memory constraints. By continuously monitoring the GPU's memory usage and adjusting the batch size accordingly, our system maximizes GPU utilization while avoiding OOM errors.

This dynamic tuning approach resulted in a remarkable 30-40% reduction in average training time. The performance boost comes from the increased number of samples processed per training step, as the system intelligently packs the maximum amount of data into each batch based on the available GPU memory.

Automatic batch size tuning improved GPU utilization and reduced the cost of frequent parameter updates.

Step 4: Disabling Expensive Text Metrics

The Ludwig framework computes numerous text-based metrics by default at every checkpoint on both the validation and test data. However, many of these metrics – such as BLEU or ROUGE – rely on computationally expensive n-gram calculations with quadratic complexity. Moreover, they can be difficult to interpret during the training process and may not be relevant for certain tasks.

By computing only the most meaningful and inexpensive metrics like Loss, we significantly reduced the computational overhead of metric evaluation. Depending on the context length of the input data, we were able to save between 1 and 17 minutes per checkpoint. This optimization dramatically improved training efficiency without compromising our ability to monitor and assess model performance.

Disable computing 10+ evaluation metrics on every checkpoint (1000 steps) on each set (validation, test).

Step 5: Disabling Gradient Checkpointing

Gradient checkpointing is a memory-saving technique that strategically stores a subset of activations during the forward pass, recomputing the rest on-the-fly during backpropagation. While this reduces memory usage, it introduces an extra forward pass, increasing computation time. Ultimately, we chose to disable gradient checkpointing for two reasons:

Our previous optimizations had sufficiently reduced memory constraints so we no longer needed it.
We adopted A100s, which allow longer sequence lengths (without OOMing) compared to A10Gs.

This change alone accelerated training by an impressive 25-35%.

The bar graph below illustrates the cumulative impact of our optimizations up to this point, highlighting the substantial performance gains achieved by thoughtfully tuning the training process.

Summary of improvements so far: 3.71x faster on average, 5x faster for longer datasets with more rows.

Step 6: Loading Pre-quantized Weights

As mentioned above, Quantized LoRA (QLoRA) is a highly efficient method for fine-tuning LLMs. The quantization process typically involves loading weights partition by partition into memory, computing quantization coefficients, multiplying them with the weights, and casting the results to uint8 before placing them on the GPU. For example, quantizing the 8 billion parameter Llama 3 model on a T4 GPU takes about 73 seconds.

However, directly loading pre-quantized weights and quantization coefficients can significantly reduce memory requirements and loading time. By employing this technique, only a single partition of weights (6G of memory) needs to be loaded into the GPU. This optimization reduces the loading time by at least 4x, with even greater benefits for larger models where on-the-fly quantization is computationally expensive, such as Mixtral 8x7B and Llama-3-70B that have 19 and 30 shards respectively.

Load quantized weights onto the GPU directly instead of quantizing weights at load time => 3x-4x faster load time.

Step 7: Sample packing

To further optimize GPU utilization, we built upon the dynamic batch size tuning from step 3 and implemented sample packing. Even with optimal batch sizes, we found that the number of tokens per sample could still be relatively small, leaving room for additional efficiency gains. Sample packing addresses this by concatenating multiple samples together, using special Beginning Of Sequence (BOS) and End Of Sequence (EOS) tokens to separate them. This technique allows us to pack more tokens into each batch, effectively compressing the data and enabling the GPU to process more information in parallel. By carefully tuning the sample packing parameters, we achieved a remarkable 2x to 5x speedup in training time.

Concatenation of three tokens to pack samples further.

Step 8: Upgrading from PyTorch's Scale Dot Product (SDP) Attention Kernel to Flash Attention 2

Attention computation involves multiplying the query and key vectors, masking, applying softmax, (optionally) applying dropout, and finally multiplying the result with the value tensor. While matrix multiplications are highly efficient on GPUs, the softmax, dropout, and masking operations consume the most time during attention computation because these are memory bound operations.

FlashAttention is an I/O-aware algorithm that optimizes attention computation by accounting for the GPU architecture. A GPU consists of a small but very fast Static Random Access Memory (SRAM, ~20MB) and a much larger but slower High Bandwidth Memory (HBM, ~40GB). FlashAttention fuses the three operations into a single kernel and splits the matrices into small blocks. These blocks are loaded into SRAM, where attention is computed "locally", then rescaled and copied back to HBM. The computations are parallelized over batch size and number of threads.

Fused kernel to compute Flash attention.

FlashAttention-2 further optimizes the process. The scaling is postponed until the very end to reduce the amount of data being copied between SRAM and HBM for each block, which further reduces the bottlenecks from memory read/writes. Parallelization is extended to the sequence length dimension, fully benefiting from the packing done in improvements #3 and #7. The order of loops is also swapped to make the most of this parallelism. Finally, the work is partitioned differently across warps (32 threads) to minimize communication between them.

These optimizations accelerated attention computation by 3x to 8x, depending on the dataset.

Step 9: Optimized CUDA Kernels for LoRA Fine Tuning

To further enhance performance, we used optimized CUDA kernels for various stages of the LoRA fine-tuning process, including quantization, dequantization, cross-entropy loss computation, and efficient forward and backward multiplications for LoRA layers, particularly those focused on Query, Key, Value (QKV) computations. By leveraging these specialized kernels, we achieved an additional 20% speedup on hardware with native BF16 support such as NVIDIA Ampere GPUs. This optimization builds upon the cumulative improvements from previous steps, further reducing training time and computational overhead.

Impact of our optimizations on training times across 4 popular datasets: average acceleration of 16x.

Next Steps: Start Accelerating Your LLM Fine-tuning Jobs

With these optimizations, we’re proud to offer the fastest, most efficient platform for fine-tuning and serving LLMs. Now you know firsthand how our talented team of engineers improved training speeds over 15x in just about two weeks. As an example, we reduced training time on the Magicoder dataset from almost 3 days to just 6.5 hours, and this is just the beginning.

Predibase offers the fastest most efficient LLM fine-tuning and serving platform.

15x Faster Fine-Tuning in Under 15 Days

Nine Steps to Accelerate Your Fine-tuning Jobs

Step 1: Upgrade Hardware From A10Gs to A100s

Step 2: Moving from Paged Optimizers to Fused Optimizers

Step 3: Automatic Batch Size Tuning

Step 4: Disabling Expensive Text Metrics

Step 5: Disabling Gradient Checkpointing

Step 6: Loading Pre-quantized Weights

Step 7: Sample packing

Step 8: Upgrading from PyTorch's Scale Dot Product (SDP) Attention Kernel to Flash Attention 2

Step 9: Optimized CUDA Kernels for LoRA Fine Tuning

Next Steps: Start Accelerating Your LLM Fine-tuning Jobs

Related Articles

Real-World LLM Inference Benchmarks: How Predibase Built the Fastest Stack

Next-Gen Inference Engine for Fine-Tuned SLMs

LLM Serving Guide: How to Build Faster Inference for Open-source Models

Join Our Community!