The Fine-tuning Index

Performance benchmarks from fine-tuning 700+ open-source LLMs

Developers are no longer constrained to costly commercial models like GPT-4. The rise of high-performing open-source LLMs—like Zephyr, Mistral, and Llama—offers an attractive alternative that enables organizations to reduce costs and own model IP. However, open-source models can fall short of commercial offerings. That’s where fine-tuning comes in. Smaller open-source models can deliver GPT-like performance through the power of fine-tuning.

To illustrate this point and help enterprise AI teams select the best open-source model for their application, we conducted over 700 LLM fine-tuning experiments. The results of our experiments are available in our Arxiv research paper and shared below through a series of interactive charts. Enjoy!

Open-source Model Fine-Tuning Leaderboard

The Fine-tuning Leaderboard shows the performance of each model aggregated across 31 distinct tasks. You can evaluate the performance pre and post fine-tuning by selecting the base or fine-tuned model button at the top. Remarkably, most of the fine-tuned open-source models surpass GPT-4 with Llama-3, Phi-3 and Zephyr demonstrating the strongest performance.

DeveloperModel

Performance

About the Report

Why

The most common questions we hear are: does fine-tuning work and what model should I use? With this report, we seek to answer these questions and others such as which task is best suited for fine-tuning. We also aim to provide AI teams with a cost-effective framework for producitonizing open-source LLMs.

What

The Fine-tuning Index contains a series of interactive tables and charts, beginning with our Model Fine-tuning Leaderboard. The Leaderboard compares the performance of GPT-4 and popular open-source models that we fine-tuned across a series of tasks. We included 10 additional key findings to help teams improve their fine-tuning efforts.

How

We analyzed 700+ fine-tuning experiments using 13 of the most popular open-source models and 31 distinct datasets and tasks. We chose models with a max of 7B parameters to ensure that any organization can train the models on low-end GPUs. We utilized accuracy scores, rouge metrics, and HumanEval to assess performance.

Key Takeaways

LoRA Fine-tuned models outperform GPT-4 on specialized tasks

Fine-tuned models serve as experts within their specific domains, surpassing the performance of GPT-4 on 85% of the tasks we tested against. On average, fine-tuning yielded an impressive improvement of 25%-50%.

Average model quality across 31 distinct tasks

The fine-tuned models outperformed GPT-4 by a wide margin on nearly all the tasks, as shown below. The detailed analysis section provides a deeper dive into individual model performance by task.

Difference in Task Performance: Best Fine-tuned Adapter vs. GPT-4

LoRA Fine-tuned models are fast and cheap to train and serve

As part of our benchmarks, we launched LoRA Land, an interactive web app that allows users to query 25+ fine-tuned Mistral adapters that outperform GPT-4. The LoRA Land models were fine-tuned for $8 each on Predibase and served on a single GPU with open-source LoRAX.

LoRA Land enabled us to benchmark the performance and cost of serving LoRA fine-tuned models in a production environment. Not only is serving fine-tuned LLMs on Predibase significantly cheaper than GPT-4, but we delivered faster speeds with little-to-no impact on request time when scaling to many models and users as shown in the table below. You can read more about our analysis in the Arxiv paper. GPT-4 benchmarks were pulled from Artificial Analysis.

Time to First Token (TTFT)

Predibase

OpenAI

As mentioned above, we also compared the cost of GPT-4 to serving your own fine-tuned adapters on dedicated deployments with open-source LoRAX and Predibase. LoRAX enables you to cost-effectively serve many adapters on a single GPU, making it possible to build your own GPT-4 with a series of fine-tuned task-specific models.

Total Inference Costs

gpt-4

Predibase Dedicated Deployment (0.25 A100)

Choosing the Right Tasks for Fine-tuning

Fine-tuning Task Leaderboard

The task leaderboard shows the performance for each fine-tuned and GPT model by the selected task. Use this chart to determine which model performs best for a specific task. Remarkably, most of the fine-tuned open-source models surpass GPT-4 with the Llama-3b series most frequently demonstrating the strongest performance.

DeveloperModel

Performance

Specialized tasks are ideal for LoRA fine-tuning

Our analysis showed that it’s more difficult to beat GPT-4 on broad tasks. For example, the following poor-performing tasks were very wide in scope: generating code for a large variety of programming languages like Python, Java, and R; answering multiple-choice questions spanning dozens of topics.

Conversely, we experienced great success when fine-tuning open-source LLMs for narrowly focused or domain-specific tasks—such as legal contract review, medical text classification, and domain-specific code generation like a company’s internal codebase. As such, it requires less effort to outperform GPT-4 when fine-tuning for a specific task vs. a generalized use case.

Difference between GPT-4 and Highest Performing Fine-tuned Model by Category

Harder to beat GPT-4

Difficulty

Easier to beat GPT-4

Choosing the Best Base Model for Fine-tuning

When selecting your base model it’s important to consider performance at a task level and in aggregate across many tasks. Specifically, this is important when using open-source frameworks like LoRAX that allow you to serve multiple fine-tuned adapters on a single base model. Getting the most out of one model is ideal.

The Llama models lead the pack

In terms of architecture, our findings indicate that the Llama series leads the pack on average, with Phi, Zephyr and Mistral following closely behind. Interestingly, Mistral is built partially on the Llama architecture and Zephyr is a derivative of the original Mistral Instruct model. This suggests that that Llama and Mistral model series are well suited for adapting to smaller, task specific jobs.

Top 5 Fine-Tuned Models Based on Average scores

Furthermore, when looking at the frequency of top performance, the Llama-3 series stands out, reinforcing its position as the leading architecture for fine-tuning.

Most frequent top performing fine-tuned models

Choosing Your Data for Fine-tuning

Specialized datasets yield the best results

During our evaluations, we explored the impact of data complexity on fine-tuning results and discovered a few moderate correlations. When looking at base model lift, we found a positive correlation for input length. Conversely, when looking at the lift over GPT-4, we found moderate negative correlations with output length. This suggests that narrower easier tasks are more likely to see success with fine-tuned adapters.

Base model liftGPT-4 lift
Base model score0.113-0.164
Input length μ0.351-0.057
Input length σ0.300-0.104
Output length μ-0.188-0.462
Output length σ-0.199-0.469
Example length μ0.279-0.188
Example length σ0.289-0.122
# examples0.275-0.134

Small data is all you need: <10,000 examples

Unlike initial model training, fine-tuning can be effective with a smaller number of examples. Although some of our datasets were extensive, the majority contained fewer than 20,000 examples. In fact, over 40% contained less than 10,000 data points.

Distribution Of Training Sets Size

Learn More

We hope you enjoyed the Fine-tuning Index and find it helpful as you train your own task-specific LLMs. We plan to update with new models so check back periodically.

All of the models were fine-tuned for less than $8 and served in production using Predibase and LoRAX. Below are some resources to help you fine-tune models that outperform GPT-4.

Ready to efficiently fine-tune and serve your own LLM?