The Fine-tuning Leaderboard

Detailed performance benchmarks for the most popular small language models fine-tuned across 30 tasks and compared to commercial models like GPT-4

About the Fine-Tuning Leaderboard

Why

Developers are no longer constrained to costly commercial models like GPT-4. High-performing SLMs offer an attractive alternative, enabling organizations to reduce costs and own model IP. Through fine-tuning, smaller open-source models can deliver GPT-like performance. This leaderboard helps enterprise AI teams select the best open-source model for their specific applications and needs.

What

The Fine-Tuning Leaderboard is a comprehensive resource that showcases the performance of the latest Small Language Models (SLMs) like Llama 3.1, Solar 10.7B, and Mistral. It compares these models against each other and OpenAI baselines across 30 diverse tasks.

How

We conducted over 700 LLM fine-tuning experiments in April 2024 and documented the results in our Arxiv research paper. We continually update the Fine-Tuning Leaderboard with new SLMs, evaluating them across 30 diverse tasks. The results are published in a comparative table, providing valuable insights into the evolving landscape of language models.

Category
Task(s)
Number of Params
Family
Model Name
Model Name
Average Score (30 datasets)
Base Model
Number of Parameters
Family
Alignment
% Improvement over GPT4
Solar Pro Preview0.798Fine-Tuned22.1BSolarInstruct21.83%
Meta-Llama-3.1-8B-Instruct0.782Fine-Tuned8.03BLlamaInstruct19.39%
Solar Mini0.781Fine-Tuned10.8BSolarInstruct19.24%
Meta-Llama-3-8B0.771Fine-Tuned8BLlamaPretrained17.71%
Meta-Llama-3.1-8B0.769Fine-Tuned8.03BLlamaInstruct17.40%
Phi-3-mini-4k-instruct0.768Fine-Tuned3.82BPhiInstruct17.25%
Meta-Llama-3-8B-Instruct0.761Fine-Tuned8BLlamaInstruct16.18%
HuggingFaceH4/zephyr-7b-beta0.756Fine-Tuned7BZephyrInstruct15.42%
Mistral-7B-v0.10.745Fine-Tuned7BMistralPretrained13.74%
Mistral-7B-Instruct-v0.10.737Fine-Tuned7BMistralInstruct12.52%
Llama-2-7b-chat-hf0.726Fine-Tuned7BLlamaInstruct10.84%
Mistral-7B-Instruct-v0.30.722Fine-Tuned7BMistralInstruct10.23%
Mistral-7B-v0.30.720Fine-Tuned7BMistralPretrained9.92%
Llama-2-7b-hf0.715Fine-Tuned7BLlamaPretrained9.16%
phi-20.687Fine-Tuned2.7BPhiPretrained4.89%
gemma-2b0.677Fine-Tuned2BGemmaPretrained3.36%
gemma-7b-it0.667Fine-Tuned7BGemmaInstruct1.83%
gemma-2b-it0.661Fine-Tuned2BGemmaInstruct0.92%
GPT-40.655Base Model1760BOpenAIPretrained0.00%
gemma-2-27b-it0.654Fine-Tuned27.2BGemmaInstruct-0.15%
gemma-7b0.652Fine-Tuned7BGemmaPretrained-0.46%
GPT-4o-mini0.617Base Model8BOpenAIPretrained-5.80%
GPT-3.5-Turbo0.596Base Model175BOpenAIPretrained-9.01%
Phi-3.5-mini-instruct0.578Fine-Tuned3.82BPhiPretrained-11.76%
Solar Mini0.543Base Model10.8BSolarInstruct-17.10%
gemma-2-27b0.534Fine-Tuned27.2BGemmaPretrained-18.47%
Phi-3-mini-4k-instruct0.523Base Model3.82BPhiInstruct-20.15%
Mistral-7B-Instruct-v0.30.509Base Model7BMistralInstruct-22.29%
Mistral-7B-Instruct-v0.10.472Base Model7BMistralInstruct-27.94%
Meta-Llama-3.1-8B-Instruct0.461Base Model8.03BLlamaInstruct-29.62%
Meta-Llama-3-8B-Instruct0.441Base Model8BLlamaInstruct-32.67%
Phi-3.5-mini-instruct0.431Base Model3.82BPhiPretrained-34.20%
gemma-7b-it0.389Base Model7BGemmaInstruct-40.61%
gemma-2-27b-it0.382Base Model27.2BGemmaInstruct-41.68%
Solar Pro Preview0.379Base Model22.1BSolarInstruct-42.14%
Llama-2-7b-chat-hf0.378Base Model7BLlamaInstruct-42.29%
HuggingFaceH4/zephyr-7b-beta0.360Base Model7BZephyrInstruct-45.04%
gemma-2b-it0.336Base Model2BGemmaInstruct-48.70%
Meta-Llama-3.1-8B0.312Base Model8.03BLlamaInstruct-52.37%
Mistral-7B-v0.30.307Base Model7BMistralPretrained-53.13%
phi-20.282Base Model2.7BPhiPretrained-56.95%
Meta-Llama-3-8B0.276Base Model8BLlamaPretrained-57.86%
Mistral-7B-v0.10.273Base Model7BMistralPretrained-58.32%
Llama-2-7b-hf0.260Base Model7BLlamaPretrained-60.31%
gemma-2-27b0.236Base Model27.2BGemmaPretrained-63.97%
gemma-7b0.192Base Model7BGemmaPretrained-70.69%
gemma-2b0.149Base Model2BGemmaPretrained-77.25%

Learn More

We hope you enjoyed the Fine-tuning Index and find it helpful as you train your own task-specific LLMs. We plan to update with new models so check back periodically.

All of the models were fine-tuned for less than $8 and served in production using Predibase and LoRAX. Below are some resources to help you fine-tune models that outperform GPT-4.

Ready to efficiently fine-tune and serve your own LLM?