Solar LLM on Predibase: The Best LLM for Fine-Tuning that beats GPT-4

June 13, 2024 · 4 min read
Predibase + Upstage Blog Hero-2

For organizations with domain-specific data and use cases, fine-tuning is one of the most performant and cost-effective ways to tailor LLMs for production applications. Through fine-tuning small LLMs for specific use cases, teams can achieve performance that surpasses that of massive general models (e.g. GPT-4) for use cases such as:

Upstage + Predibase Use Cases

However, not all LLMs are equally efficient for fine-tuning. Some models are more suitable for fine-tuning than others because each model was developed with different design philosophies (e.g., a model designed to be good with broad, general use cases vs. a model designed to be customized for specific applications). 

Predibase is the fastest and most efficient way to fine-tune and serve task-specific LLMs and has deep experience fine-tuning an extensive collection of open-source LLMs and small proprietary LLMs that are ideal for fine-tuning. 

After running nearly 500 fine-tuning experiments, we can quantifiably demonstrate that Upstage’s Solar LLM is the most competent model for fine-tuning, and are excited to announce that Solar LLM is now available for teams to fine-tune and serve on Predibase.

Try Predibase for Free

Introducing Upstage’s Solar LLMs

Why did Upstage build Solar LLM?

Upstage is a leading enterprise AI company that has a proven track record of providing powerful custom document processing / LLM solutions for global enterprises across various industries such as financial services, healthcare, supply chain, and legal. 

With its deep roots in the enterprise AI space, Upstage developed Solar LLM with the belief that for mainstream enterprise adoption, enterprises need a powerful, purpose-trainable LLM solution that can be easily trained with their private data and securely hosted on their premises.  

As a base model, it is intentionally sized small and light to run on a single GPU, offering good performance (i.e., accuracy and speed) and cost competitiveness, with the potential for even better performance through fine-tuning.

Upstage + Predibase LLM Design Philosophy

With fine-tuning, Upstage has seen further improved performance in several tasks including translation, math solving, and categorization, which resulted in exceeding the performance of GPT-4. 

How is Solar LLM good for fine-tuning? 

With further customization in mind, Solar LLM has been pre-trained to improve performance on specific downstream tasks through fine-tuning. Specifically, Upstage has invested significant effort in optimizing the balance of the dataset used in pre-training and instruction, and has also evenly regulated the domain distribution to accommodate various fine-tuning scenarios for enterprises. 

This approach differs from other general-purpose LLMs, where fine-tuning may not always result in significant performance improvements, as these models are designed for general use cases.

Predibase's Fine-Tuning and Inference Technology

Predibase is the leading developer platform for fine-tuning and serving LLMs. Built from the ground up to be fast, reliable, and cost-effective, Predibase has built a best-in-class fine-tuning experience. Predibase manages the compute resources required for fine-tuning so teams don’t need to worry about out of memory (OOM) errors and can trust that the right serverless GPU hardware will be used for the job. 

Predibase also offers inference with low latency (0.20 seconds to first token) and lightning fast throughput (200 tokens per second). Plus, teams can serve hundreds of fine-tuned LLMs from a single GPU–whether it’s a high-end A100 or H100 or a commodity A10G–with LoRA eXchange (LoRAX), the open-source serving framework developed by Predibase, making Predibase one of the most cost-effective platforms for serving fine-tuned LLMs.

To evaluate solar-1-mini-chat-240520, we decided to compare its fine-tuned task-specific performance against 13 popular open-source LLMs of a similar weight class and 2 closed source base models: GPT-3.5 Turbo and GPT-4.

High-Level Experimentation Methodology

Here's a brief overview of our experimentation setup:

  1. Dataset Selection: We meticulously selected 31 diverse datasets spanning 5 categories: Natural language understanding, coding, knowledge, reasoning and math.
  2. Dataset Preparation: Each of the 31 datasets was split into a training set and a held-out evaluation set to ensure robust evaluation.
  3. Model Training: We chose a base model and trained it on each of these datasets, utilizing their respective instruct/chat templates. This process was repeated for every base model included in this experiment.
  4. Batch Evaluation: Post-training, we conducted batch evaluations using the fine-tuned LoRA adapters on the held-out evaluation sets. Depending on the task type, we employed various metrics such as accuracy, ROUGE, HumanEval, among others, to gauge performance effectively.
  5. Results Comparison: Finally, we compiled the results and performed a comparative analysis of the models to identify the top performers.

Results

After tabulating all of these results, we found that solar-1-mini-chat-240520 is the strongest performing model at the ~11B parameter model weight class, outperforming most other open-source models by a significant amount.

Here's a deeper look at two slices of the metrics that serve as supporting evidence for the observation above.

Slice 1: Overall Performance of Solar Fine-Tunes

This metric quantifies how often a specific model attains the highest score compared to all other models for a given task. This frequency is summed across all 31 tasks to assess the overall effectiveness of each model. In other words, it measures the number of times model X outperforms its peers across all tasks.

We found that Solar fine-tunes led with the highest score on 16 out of 31 tasks (approximately 51.6%). Following closely, Phi-3, Llama-3-8B, Zephyr-7b, and GPT-4 (base) tied for second place, each achieving a score of 3 out of 31 tasks (approximately 9.7%).

Upstage + Predibase Slice 1

Slice 2: Head to Head Performance of Fine-Tuned Solar

Slice 2 provides insights into the competitive performance of Solar fine-tunes by quantifying the frequency with which they achieve superior results compared to fine-tunes from other base models. Each percentage value indicates the proportion of tasks where Solar fine-tunes prevail over the competing model.

Upstage + Predibase Slice 2

For instance, a win rate of 83.87% against Phi-2 signifies that Solar fine-tunes outperformed Phi-2 on approximately 83.87% (26/31) of the tasks. Interestingly, Zephyr-7b-beta gives solar-1-mini-chat-240520 the closest competition, while solar-1-mini-chat-240520 fine-tunes almost always beat base GPT-3.5 Turbo.

For more thorough insights on experimentation setup and results, you can check out this paper and Predibase’s fine-tuning index.

Best In Class Inference Cost Efficiency

With LoRAX, Predibase’s framework to dynamically serve 100s of fine-tuned LoRA models using a single GPU, one can serve all 31 of the fine-tuned solar-1-mini-chat-240520  LoRA adapters for the cost of a single dedicated LLM deployment.

It can be deployed on hardware as small as an A10G with 24GB of VRAM for cost savings. However, it would need to be scaled up to an A100 to handle very high request volume (measured in queries per second) and to get 2x faster throughput.

To get a feel for response latency using LoRAX, check out LoRA Land, a demonstration of fine-tuned models and LoRAX.

Predibase also supports speculative decoding via Medusa at training time, which leads to ~ 3x faster inference throughput for fine-tuned models with no degradation in fine-tuned model performance.

What's next?

Want to try out fine-tuning Solar LLM on Predibase? Sign up for your free trial today!

Join us on July 11th for a webinar where the Predibase and Upstage teams will dive into the full story behind Solar LLM and showcase the performance results in more detail.

Related Articles

Join Our Community!