The Fine-tuning Index
Performance benchmarks from fine-tuning 700+ open-source LLMs
We've launched a new Fine-Tuning Leaderboard with the latest models!
This page accompanies our April 2024 Arxiv paper and will not be updated with new models.
Developers are no longer constrained to costly commercial models like GPT-4. The rise of high-performing open-source LLMs—like Zephyr, Mistral, and Llama—offers an attractive alternative that enables organizations to reduce costs and own model IP. However, open-source models can fall short of commercial offerings. That’s where fine-tuning comes in. Smaller open-source models can deliver GPT-like performance through the power of fine-tuning.
To illustrate this point and help enterprise AI teams select the best open-source model for their application, we conducted over 700 LLM fine-tuning experiments. The results of our experiments are available in our Arxiv research paper and shared below through a series of interactive charts. Enjoy!
Open-source Model Fine-Tuning Leaderboard
The Fine-tuning Leaderboard shows the performance of each model aggregated across 31 distinct tasks. You can evaluate the performance pre and post fine-tuning by selecting the base or fine-tuned model button at the top. Remarkably, most of the fine-tuned open-source models surpass GPT-4 with Llama-3, Phi-3 and Zephyr demonstrating the strongest performance.
DeveloperModel
Performance
About the Fine-Tuning Index
Why
Developers are no longer constrained to costly commercial models like GPT-4. High-performing SLMs offer an attractive alternative, enabling organizations to reduce costs and own model IP. Through fine-tuning, smaller open-source models can deliver GPT-like performance. This index helps enterprise AI teams select the best open-source model for their specific applications and needs.
What
The Fine-Tuning Index is a comprehensive resource that showcases the performance of the latest Small Language Models (SLMs) like Llama 3.1, Solar 10.7B, and Mistral. It compares these models against each other and OpenAI baselines across 30 diverse tasks.
How
We conducted over 700 LLM fine-tuning experiments in April 2024 and documented the results in our Arxiv research paper. We continually update the Fine-Tuning Index with new SLMs, evaluating them across 30 diverse tasks. The results are published in a comparative table, providing valuable insights into the evolving landscape of language models.
Key Takeaways
LoRA Fine-tuned models outperform GPT-4 on specialized tasks
Fine-tuned models serve as experts within their specific domains, surpassing the performance of GPT-4 on 85% of the tasks we tested against. On average, fine-tuning yielded an impressive improvement of 25%-50%.
Average model quality across 31 distinct tasks
The fine-tuned models outperformed GPT-4 by a wide margin on nearly all the tasks, as shown below. The detailed analysis section provides a deeper dive into individual model performance by task.
Difference in Task Performance: Best Fine-tuned Adapter vs. GPT-4
LoRA Fine-tuned models are fast and cheap to train and serve
As part of our benchmarks, we launched LoRA Land, an interactive web app that allows users to query 25+ fine-tuned Mistral adapters that outperform GPT-4. The LoRA Land models were fine-tuned for $8 each on Predibase and served on a single GPU with open-source LoRAX.
LoRA Land enabled us to benchmark the performance and cost of serving LoRA fine-tuned models in a production environment. Not only is serving fine-tuned LLMs on Predibase significantly cheaper than GPT-4, but we delivered faster speeds with little-to-no impact on request time when scaling to many models and users as shown in the table below. You can read more about our analysis in the Arxiv paper. GPT-4 benchmarks were pulled from Artificial Analysis.
Time to First Token (TTFT)
Predibase
OpenAI
As mentioned above, we also compared the cost of GPT-4 to serving your own fine-tuned adapters on dedicated deployments with open-source LoRAX and Predibase. LoRAX enables you to cost-effectively serve many adapters on a single GPU, making it possible to build your own GPT-4 with a series of fine-tuned task-specific models.
Total Inference Costs
gpt-4
Predibase Dedicated Deployment (0.25 A100)
Choosing the Right Tasks for Fine-tuning
Fine-tuning Task Leaderboard
The task leaderboard shows the performance for each fine-tuned and GPT model by the selected task. Use this chart to determine which model performs best for a specific task. Remarkably, most of the fine-tuned open-source models surpass GPT-4 with the Llama-3b series most frequently demonstrating the strongest performance.
DeveloperModel
Performance
Specialized tasks are ideal for LoRA fine-tuning
Our analysis showed that it’s more difficult to beat GPT-4 on broad tasks. For example, the following poor-performing tasks were very wide in scope: generating code for a large variety of programming languages like Python, Java, and R; answering multiple-choice questions spanning dozens of topics.
Conversely, we experienced great success when fine-tuning open-source LLMs for narrowly focused or domain-specific tasks—such as legal contract review, medical text classification, and domain-specific code generation like a company’s internal codebase. As such, it requires less effort to outperform GPT-4 when fine-tuning for a specific task vs. a generalized use case.
Difference between GPT-4 and Highest Performing Fine-tuned Model by Category
Harder to beat GPT-4
Difficulty
Easier to beat GPT-4
Choosing the Best Base Model for Fine-tuning
When selecting your base model it’s important to consider performance at a task level and in aggregate across many tasks. Specifically, this is important when using open-source frameworks like LoRAX that allow you to serve multiple fine-tuned adapters on a single base model. Getting the most out of one model is ideal.
The Llama models lead the pack
In terms of architecture, our findings indicate that the Llama series leads the pack on average, with Phi, Zephyr and Mistral following closely behind. Interestingly, Mistral is built partially on the Llama architecture and Zephyr is a derivative of the original Mistral Instruct model. This suggests that that Llama and Mistral model series are well suited for adapting to smaller, task specific jobs.
Top 5 Fine-Tuned Models Based on Average scores
Furthermore, when looking at the frequency of top performance, the Llama-3 series stands out, reinforcing its position as the leading architecture for fine-tuning.
Most frequent top performing fine-tuned models
Choosing Your Data for Fine-tuning
Specialized datasets yield the best results
During our evaluations, we explored the impact of data complexity on fine-tuning results and discovered a few moderate correlations. When looking at base model lift, we found a positive correlation for input length. Conversely, when looking at the lift over GPT-4, we found moderate negative correlations with output length. This suggests that narrower easier tasks are more likely to see success with fine-tuned adapters.
Base model lift | GPT-4 lift | |
---|---|---|
Base model score | 0.113 | -0.164 |
Input length μ | 0.351 | -0.057 |
Input length σ | 0.300 | -0.104 |
Output length μ | -0.188 | -0.462 |
Output length σ | -0.199 | -0.469 |
Example length μ | 0.279 | -0.188 |
Example length σ | 0.289 | -0.122 |
# examples | 0.275 | -0.134 |
Small data is all you need: <10,000 examples
Unlike initial model training, fine-tuning can be effective with a smaller number of examples. Although some of our datasets were extensive, the majority contained fewer than 20,000 examples. In fact, over 40% contained less than 10,000 data points.
Distribution Of Training Sets Size
Learn More
We hope you enjoyed the Fine-tuning Index and find it helpful as you train your own task-specific LLMs. We plan to update with new models so check back periodically.
All of the models were fine-tuned for less than $8 and served in production using Predibase and LoRAX. Below are some resources to help you fine-tune models that outperform GPT-4.
Read the Arxiv Paper
Deep dive on all of the technical details and results from our 300+ fine-tuning experiments with our research paper.
Watch the Webinar
Watch this on-demand session for a technical session to learn how we fine-tuned LLMs to rival GPT-4 and lessons learned.
Free Fine-tuning Credits
Sign-up for $25 in free credits on Predibase to fine-tune and serve your own models in production.