Introducing the Fine-Tuning Index for LLMs

May 21, 2024 · less than a minute read
Fine-tuning Index Card (2)
Will Van Eaton
Will Van Eaton

We’re excited to announce the Fine-Tuning Index. The index highlights how fine-tuning open-source LLMs significantly boosts their performance in production environments and ranks top open-source and commercial LLMs by performance across various tasks, based on insights from over 700 fine-tuning experiments.

Designed to assist enterprise AI teams in selecting the most suitable open-source models for their specific needs, the Fine-Tuning Index reports the performance of 13 popular open-source LLMs across 31 diverse tasks, comparing them to leading commercial models like GPT-4.

Predibase Fine-Tuning-index-leaderboard

Through talking with dozens of teams exploring open-source LLMs for powering their GenAI initiatives, we’ve heard that many AI teams are uncertain about which open-source LLM will best suit their tasks. While some models may appear more capable out of the box, the nuanced performance differences between base models and their fine-tuned counterparts have not been thoroughly aggregated and reported—until now. The Fine-Tuning Index empowers teams to select the optimal open-source LLM with greater confidence, reducing the time spent on trial and error and accelerating their journey to production with the right fine-tuned model.

Key findings from the Fine-Tuning Index research:

  • Outperformance of GPT-4: The majority of fine-tuned open-source models outperformed GPT-4 and GPT-4o, with Llama 3, Phi-3, and Zephyr leading the way.
  • Cost-Effectiveness: Fine-tuned models were not only more cost-effective but also faster to train and deploy, with GPT-4 incurring significantly higher monthly costs for enterprise use cases. Fine-tuning each LLM for a typical task cost approximately $8 in compute resources.
  • Specialized Task Superiority: Fine-tuned LLMs excelled in specialized tasks such as legal contract review and medical classification, outperforming GPT-4 on 85% of tested tasks.
  • Optimal Base Models: The architectures of Llama 3, Phi-3, and Zephyr emerged as top choices for fine-tuning, demonstrating superior performance across various tasks.

We dive deeper into each of these findings in a recently published report by the Predibase research team titled “LoRA Land: 310 Fine-tuned LLMs that Rival GPT-4, A Technical Report.” This report not only showcases the potential of open-source LLMs but also provides actionable insights and tools for organizations aiming to leverage these models effectively. By democratizing access to advanced language models and offering cost-effective solutions, Predibase is paving the way for teams to bring innovative AI products to market.

Check out the Fine-Tuning Index yourself for more details.

Related Articles

Join Our Community!