The Fine-tuning Leaderboard

Detailed performance benchmarks for the most popular small language models fine-tuned across 30 tasks and compared to commercial models like GPT-4

Try Predibase Speak to an SLM Expert

About the Fine-Tuning Leaderboard

Why

Developers are no longer constrained to costly commercial models like GPT-4. High-performing SLMs offer an attractive alternative, enabling organizations to reduce costs and own model IP. Through fine-tuning, smaller open-source models can deliver GPT-like performance. This leaderboard helps enterprise AI teams select the best open-source model for their specific applications and needs.

What

The Fine-Tuning Leaderboard is a comprehensive resource that showcases the performance of the latest Small Language Models (SLMs) like Llama 3.1, Solar 10.7B, and Mistral. It compares these models against each other and OpenAI baselines across 30 diverse tasks.

How

We conducted over 700 LLM fine-tuning experiments in April 2024 and documented the results in our Arxiv research paper. We continually update the Fine-Tuning Leaderboard with new SLMs, evaluating them across 30 diverse tasks. The results are published in a comparative table, providing valuable insights into the evolving landscape of language models.

Category

Task(s)

# of Parameters (B)

Family

Alignment

Model Name

Model Name	Average Score(Fine-Tuned, 30 datasets)	Average Score(Base Model, 30 datasets)	Number of Parameters	Family	Alignment	% Improvement over GPT4
Llama-3.3-70B-Instruct	0.810	0.559	70.6B	Llama	Instruct	23.66%
Solar Pro Preview	0.798	0.382	22.1B	Solar	Instruct	21.83%
Llama-3.1-8B-Instruct	0.782	NaN	8.03B	Llama	Instruct	19.39%
Solar Mini	0.781	0.543	10.8B	Solar	Instruct	19.24%
Llama-3.1-8B	0.769	NaN	8.03B	Llama	Pretrained	17.40%
Phi-3-mini-4k-instruct	0.768	0.523	3.82B	Phi	Instruct	17.25%
Llama-3-8B-Instruct	0.761	NaN	8B	Llama	Instruct	16.18%
GPT-4o-mini	0.757	0.236	N/A	OpenAI	Pretrained	15.57%
HuggingFaceH4/zephyr-7b-beta	0.756	0.360	7B	Zephyr	Instruct	15.42%
Llama-3.2-3B-Instruct	0.746	0.438	3.21B	Llama	Instruct	13.89%
Mistral-7B-v0.1	0.745	0.273	7B	Mistral	Pretrained	13.74%
Llama-3-8B	0.743	NaN	8B	Llama	Pretrained	13.44%
Mistral-7B-Instruct-v0.1	0.737	0.472	7B	Mistral	Instruct	12.52%
Llama-3.2-3B	0.731	0.305	3.21B	Llama	Pretrained	11.60%
Llama-2-7b-chat-hf	0.726	0.378	7B	Llama	Instruct	10.84%
Mistral-7B-Instruct-v0.3	0.722	0.509	7B	Mistral	Instruct	10.23%
Mistral-7B-v0.3	0.720	0.307	7B	Mistral	Pretrained	9.92%
Llama-2-7b-hf	0.715	0.260	7B	Llama	Pretrained	9.16%
phi-2	0.687	0.282	2.7B	Phi	Pretrained	4.89%
gemma-2b	0.677	0.149	2B	Gemma	Pretrained	3.36%
Llama-3.2-1B	0.677	0.276	1.24B	Llama	Pretrained	3.36%
Llama-3.2-1B-Instruct	0.676	0.348	1.24B	Llama	Instruct	3.21%
gemma-7b-it	0.667	0.389	7B	Gemma	Instruct	1.83%
gemma-2b-it	0.661	0.336	2B	Gemma	Instruct	0.92%
gemma-2-27b-it	0.654	0.382	27.2B	Gemma	Instruct	-0.15%
gemma-7b	0.652	0.192	7B	Gemma	Pretrained	-0.46%
Phi-3.5-mini-instruct	0.578	0.431	3.82B	Phi	Instruct	-11.76%
gemma-2-27b	0.534	0.236	27.2B	Gemma	Pretrained	-18.47%
GPT-3.5-Turbo	--	0.596	175B	OpenAI	Pretrained	-9.01%
GPT-4	--	0.655	1760B	OpenAI	Pretrained	0.00%

Learn More

We hope you enjoyed the Fine-tuning Index and find it helpful as you train your own task-specific LLMs. We plan to update with new models so check back periodically.

All of the models were fine-tuned for less than $8 and served in production using Predibase and LoRAX. Below are some resources to help you fine-tune models that outperform GPT-4.

Ready to efficiently fine-tune and serve your own LLM?

Try Predibase for Free