The Fine-tuning Leaderboard

Detailed performance benchmarks for the most popular small language models fine-tuned across 30 tasks and compared to commercial models like GPT-4

About the Fine-Tuning Leaderboard

Why

Developers are no longer constrained to costly commercial models like GPT-4. High-performing SLMs offer an attractive alternative, enabling organizations to reduce costs and own model IP. Through fine-tuning, smaller open-source models can deliver GPT-like performance. This leaderboard helps enterprise AI teams select the best open-source model for their specific applications and needs.

What

The Fine-Tuning Leaderboard is a comprehensive resource that showcases the performance of the latest Small Language Models (SLMs) like Llama 3.1, Solar 10.7B, and Mistral. It compares these models against each other and OpenAI baselines across 30 diverse tasks.

How

We conducted over 700 LLM fine-tuning experiments in April 2024 and documented the results in our Arxiv research paper. We continually update the Fine-Tuning Leaderboard with new SLMs, evaluating them across 30 diverse tasks. The results are published in a comparative table, providing valuable insights into the evolving landscape of language models.

Category
Task(s)
# of Parameters (B)
Family
Alignment
Model Name
Model Name
Average Score(Fine-Tuned, 30 datasets)
Average Score(Base Model, 30 datasets)
Number of Parameters
Family
Alignment
% Improvement over GPT4
Llama-3.3-70B-Instruct0.8100.55970.6BLlamaInstruct23.66%
Solar Pro Preview0.7980.38222.1BSolarInstruct21.83%
Llama-3.1-8B-Instruct0.782NaN8.03BLlamaInstruct19.39%
Solar Mini0.7810.54310.8BSolarInstruct19.24%
Llama-3.1-8B0.769NaN8.03BLlamaPretrained17.40%
Phi-3-mini-4k-instruct0.7680.5233.82BPhiInstruct17.25%
Llama-3-8B-Instruct0.761NaN8BLlamaInstruct16.18%
GPT-4o-mini0.7570.236N/AOpenAIPretrained15.57%
HuggingFaceH4/zephyr-7b-beta0.7560.3607BZephyrInstruct15.42%
Llama-3.2-3B-Instruct0.7460.4383.21BLlamaInstruct13.89%
Mistral-7B-v0.10.7450.2737BMistralPretrained13.74%
Llama-3-8B0.743NaN8BLlamaPretrained13.44%
Mistral-7B-Instruct-v0.10.7370.4727BMistralInstruct12.52%
Llama-3.2-3B0.7310.3053.21BLlamaPretrained11.60%
Llama-2-7b-chat-hf0.7260.3787BLlamaInstruct10.84%
Mistral-7B-Instruct-v0.30.7220.5097BMistralInstruct10.23%
Mistral-7B-v0.30.7200.3077BMistralPretrained9.92%
Llama-2-7b-hf0.7150.2607BLlamaPretrained9.16%
phi-20.6870.2822.7BPhiPretrained4.89%
gemma-2b0.6770.1492BGemmaPretrained3.36%
Llama-3.2-1B0.6770.2761.24BLlamaPretrained3.36%
Llama-3.2-1B-Instruct0.6760.3481.24BLlamaInstruct3.21%
gemma-7b-it0.6670.3897BGemmaInstruct1.83%
gemma-2b-it0.6610.3362BGemmaInstruct0.92%
gemma-2-27b-it0.6540.38227.2BGemmaInstruct-0.15%
gemma-7b0.6520.1927BGemmaPretrained-0.46%
Phi-3.5-mini-instruct0.5780.4313.82BPhiInstruct-11.76%
gemma-2-27b0.5340.23627.2BGemmaPretrained-18.47%
GPT-3.5-Turbo--0.596175BOpenAIPretrained-9.01%
GPT-4--0.6551760BOpenAIPretrained0.00%

Learn More

We hope you enjoyed the Fine-tuning Index and find it helpful as you train your own task-specific LLMs. We plan to update with new models so check back periodically.

All of the models were fine-tuned for less than $8 and served in production using Predibase and LoRAX. Below are some resources to help you fine-tune models that outperform GPT-4.

Ready to efficiently fine-tune and serve your own LLM?