The Fine-tuning Leaderboard
Detailed performance benchmarks for the most popular small language models fine-tuned across 30 tasks and compared to commercial models like GPT-4
About the Fine-Tuning Leaderboard
Why
Developers are no longer constrained to costly commercial models like GPT-4. High-performing SLMs offer an attractive alternative, enabling organizations to reduce costs and own model IP. Through fine-tuning, smaller open-source models can deliver GPT-like performance. This leaderboard helps enterprise AI teams select the best open-source model for their specific applications and needs.
What
The Fine-Tuning Leaderboard is a comprehensive resource that showcases the performance of the latest Small Language Models (SLMs) like Llama 3.1, Solar 10.7B, and Mistral. It compares these models against each other and OpenAI baselines across 30 diverse tasks.
How
We conducted over 700 LLM fine-tuning experiments in April 2024 and documented the results in our Arxiv research paper. We continually update the Fine-Tuning Leaderboard with new SLMs, evaluating them across 30 diverse tasks. The results are published in a comparative table, providing valuable insights into the evolving landscape of language models.
Model Name | Average Score(Fine-Tuned, 30 datasets) | Average Score(Base Model, 30 datasets) | Number of Parameters | Family | Alignment | % Improvement over GPT4 |
---|---|---|---|---|---|---|
Solar Pro Preview | 0.798 | 0.382 | 22.1B | Solar | Pretrained | 21.83% |
Llama-3.1-8B-Instruct | 0.782 | 0.461 | 8.03B | Llama | Instruct | 19.39% |
Solar Mini | 0.781 | 0.543 | 10.8B | Solar | Pretrained | 19.24% |
Llama-3.1-8B | 0.769 | 0.312 | 8.03B | Llama | Pretrained | 17.40% |
Phi-3-mini-4k-instruct | 0.768 | 0.523 | 3.82B | Phi | Instruct | 17.25% |
Llama-3-8B-Instruct | 0.761 | 0.441 | 8B | Llama | Instruct | 16.18% |
GPT-4o-mini | 0.757 | 0.236 | N/A | OpenAI | Pretrained | 15.57% |
HuggingFaceH4/zephyr-7b-beta | 0.756 | 0.360 | 7B | Zephyr | Instruct | 15.42% |
Llama-3.2-3B-Instruct | 0.746 | 0.438 | 3.21B | Llama | Instruct | 13.89% |
Mistral-7B-v0.1 | 0.745 | 0.273 | 7B | Mistral | Pretrained | 13.74% |
Llama-3-8B | 0.743 | 0.276 | 8B | Llama | Pretrained | 13.44% |
Mistral-7B-Instruct-v0.1 | 0.737 | 0.472 | 7B | Mistral | Instruct | 12.52% |
Llama-3.2-3B | 0.731 | 0.305 | 3.21B | Llama | Pretrained | 11.60% |
Llama-2-7b-chat-hf | 0.726 | 0.378 | 7B | Llama | Instruct | 10.84% |
Mistral-7B-Instruct-v0.3 | 0.722 | 0.509 | 7B | Mistral | Instruct | 10.23% |
Mistral-7B-v0.3 | 0.720 | 0.307 | 7B | Mistral | Pretrained | 9.92% |
Llama-2-7b-hf | 0.715 | 0.260 | 7B | Llama | Pretrained | 9.16% |
phi-2 | 0.687 | 0.282 | 2.7B | Phi | Pretrained | 4.89% |
gemma-2b | 0.677 | 0.149 | 2B | Gemma | Pretrained | 3.36% |
Llama-3.2-1B | 0.677 | 0.276 | 1.24B | Llama | Pretrained | 3.36% |
Llama-3.2-1B-Instruct | 0.676 | 0.348 | 1.24B | Llama | Instruct | 3.21% |
gemma-7b-it | 0.667 | 0.389 | 7B | Gemma | Pretrained | 1.83% |
gemma-2b-it | 0.661 | 0.336 | 2B | Gemma | Pretrained | 0.92% |
gemma-2-27b-it | 0.654 | 0.382 | 27.2B | Gemma | Instruct | -0.15% |
gemma-7b | 0.652 | 0.192 | 7B | Gemma | Pretrained | -0.46% |
Phi-3.5-mini-instruct | 0.578 | 0.431 | 3.82B | Phi | Pretrained | -11.76% |
gemma-2-27b | 0.534 | 0.236 | 27.2B | Gemma | Pretrained | -18.47% |
GPT-3.5-Turbo | -- | 0.596 | 175B | OpenAI | Pretrained | -9.01% |
GPT-4 | -- | 0.655 | 1760B | OpenAI | Pretrained | 0.00% |
Learn More
We hope you enjoyed the Fine-tuning Index and find it helpful as you train your own task-specific LLMs. We plan to update with new models so check back periodically.
All of the models were fine-tuned for less than $8 and served in production using Predibase and LoRAX. Below are some resources to help you fine-tune models that outperform GPT-4.
Read the Arxiv Paper
Deep dive on all of the technical details and results from our 300+ fine-tuning experiments with our research paper.
Watch the Webinar
Watch this on-demand session for a technical session to learn how we fine-tuned LLMs to rival GPT-4 and lessons learned.
Free Fine-tuning Credits
Sign-up for $25 in free credits on Predibase to fine-tune and serve your own models in production.