LoRA Land: Fine-Tuned Open-Source LLMs that Outperform GPT-4

February 20, 2024 · 5 min read
LoRA Land Blog 2
Timothy Wang
Timothy Wang
Justin Zhao
Justin Zhao
Will Van Eaton
Will Van Eaton

TL;DR: We’re excited to release LoRA Land, a collection of 25 fine-tuned Mistral-7b models that consistently outperform base models by 70% and GPT-4 by 4-15%, depending on the task. LoRA Land’s 25 task-specialized large language models (LLMs) were all fine-tuned with Predibase for less than $8.00 each on average and are all served from a single A100 GPU using LoRAX, our open source framework that allows users to serve hundreds of adapter-based fine-tuned models on a single GPU. This collection of specialized fine-tuned models–all trained with the same base model–offers a blueprint for teams seeking to efficiently and cost-effectively deploy highly performant AI systems.

Join our webinar on February 29th to learn more!

Fine-Tuned Mistral-7b vs. GPT-4 2

LLM Benchmarks: 25 fine-tuned Mistral-7b adapters that outperform GPT-4.

The Need for Efficient Fine-Tuning and Serving

With the continuous growth in the number of parameters of transformer-based pretrained language models (PLMs) and the emergence of large language models (LLMs) with billions of parameters, it has become increasingly challenging to adapt them to specific downstream tasks, especially in environments with limited computational resources or budgets. Parameter Efficient Fine-Tuning (PEFT) and Quantized Low Rank Adaptation (QLoRA) offer an effective solution by reducing the number of fine-tuning parameters and memory usage while achieving comparable performance to full fine-tuning. 

Predibase has incorporated these best practices into its fine-tuning platform and, to demonstrate the accessibility and affordability of adapter-based fine-tuning of open-source LLMs, has fine-tuned 25 models for less than $8 each on average in terms of GPU costs.

Fine-tuned LLMs have historically also been very expensive to put into production and serve, requiring dedicated GPU resources for each fine-tuned model. For teams that plan on deploying multiple fine-tuned models to address a range of use cases, these GPU expenses can often be a bottleneck for innovation. LoRAX, the open-source platform for serving fine-tuned LLMs developed by Predibase, enables teams to deploy hundreds of fine-tuned LLMs for the cost of one from a single GPU. 

Serverless Fine-Tuned Deployment-2-13

Serverless Fine-tuned Endpoints: Cost-efficiently serve 100s of LLMs on single GPU.

By building LoRAX into the Predibase platform and serving many fine-tuned models from a single GPU, Predibase is able to offer customers Serverless Fine-tuned Endpoints , meaning users don’t need dedicated GPU resources for serving. This enables:

  • Significant cost savings: Only pay for what you use. No more paying for a dedicated GPU when you don’t need it.
  • Scalable infrastructure: LoRAX enables Predibase’s serving infrastructure to scale with your AI initiatives. Whether you’re testing one fine-tuned model or deploying one hundred in production, our infra meets you where you are.
  • Instant deployment and prompting: By not waiting for a cold GPU to spin up before prompting each fine-tuned adapter, you can test and iterate on your models much faster.

These underlying technologies and fine-tuning best practices built into Predibase significantly simplified the process of creating this collection of fine-tuned LLMs. As you’ll see, we were able to create these task-specific models that outperform GPT-4 with mostly out of the box training configurations.

Methodology

We first chose representative datasets, then we fine-tuned the models with Predibase and analyzed the results. We’ll cover each step in more detail.

Dataset Selection

We selected our datasets from ones that were widely available and either commonly used in benchmarking or as a proxy for industry tasks. Datasets that reflected common tasks in industry include content moderation (Jigsaw), SQL generation (WikiSQL), and sentiment analysis (SST2). We’ve also included evaluations on datasets that are more commonly evaluated in research, such as CoNLL++ for NER, QQP for question comparison, and many others. The tasks we fine-tuned span from classical text generation to more structured output and classification tasks.

Predibase-LoRA-Land-Datasets

Table of LoRA Land datasets spanning academic benchmarks, summarization, code generation, and more.

The length of the input texts varied substantially across tasks, ranging from relatively short texts to exceedingly long documents. Many datasets exhibited a long-tail distribution, where a small number of examples have significantly longer sequences than the average. To balance between accommodating longer sequences and maintaining computational efficiency, we opted to train models with a p95 percentile of the maximum text length.

We define fixed splits for reproducibility based on published train-validation-test splits, when available.

Training Configuration Template

Predibase is built on top of the open source Ludwig framework, which makes it possible to define fine-tuning tasks through a simple configuration YAML file. We generally left the default configuration in Predibase untouched, mostly focusing on editing the prompt template and the outputs. While Predibase allows users to manually specify various fine-tuning parameters, the default values have been refined by fine-tuning 100s of models to maximize performance on most tasks out of the box.

Here is an example of one of the configurations we used:

prompt:
  template: >-
    FILL IN YOUR PROMPT TEMPLATE HERE
output_feature: label_and_explanation
preprocessing:
  split:
    type: fixed
    column: split
adapter:
  type: lora
quantization:
  # Uses bitsandbytes.
  bits: 4
base_model: mistralai/Mistral-7b-v0.1

Simple YAML template for fine-tuning LoRA adapters for Mistral models.

Prompt Selection

When fine-tuning an LLM, users can define a prompt template to apply to each datapoint in the dataset to instruct the model about the specifics of the task. We deliberately chose prompt templates that would make the task fair for both our adapters and for instruction-tuned models like GPT-4. For example, instead of passing in a template that simply says “{text}”, the prompt includes detailed instructions about the task. Single-shot and few-shot examples are provided for tasks that require additional context like named entity recognition and data-to-text.

Evaluation Metrics

We created a parallelized evaluation harness that sends high volumes of queries to Predibase LoRAX-powered REST APIs, which also allows us to collect thousands of responses for OpenAI, Mistral, and fine-tuned models in a matter of seconds.

For evaluating model quality, we assess each adapter on fixed, 1000-sample subsets of held out test data excluded from training. We typically employ accuracy for classification tasks and ROUGE for tasks involving generation. However, in instances where there's a mismatch in the models' output types (for example, when our adapter produces an index while GPT-4 generates the actual class), we resort to designing custom metrics to facilitate a fair comparison of scores.

We plan to publish a comprehensive paper in the coming weeks, providing an in-depth explanation of our methodology and findings.

Results

Adapter name

Dataset

Metric

Fine-Tuned

GPT-4

GPT-3.5-turbo

Mistral-7b-instruct

Mistral-7b

Question Answering Explained (drop)

drop_explained

label_and_explanation

0.33

0.12

0.09

0

0

Named Entity Recognition (CoNLL++)

conllpp

rouge

0.99

0.75

0.81

0.65

0.12

Toxicity Detection (Jigsaw)

jigsaw

accuracy

0.84

0.76

0.74

0.52

0

News Topic Identification Explained (ag_news)

agnews_explained

label_and_explanation

0.45

0.19

0.23

0.25

0

Sentence Comparison (MRPC)

glue_mrpc

accuracy

0.86

0.78

0.68

0.65

0

Sentence Similarity (STSB)

glue_stsb

stsb

0.45

0.2

0.17

0

0

WebNLG (Web-based Natural Language Generation)*

webnlg

rouge

0.53

0.57

0.52

0.51

0.17

Question Comparison (QQP)

glue_qqp

accuracy

0.88

0.84

0.79

0.68

0

News Content Generation (tldr_news)

tldr_news

rouge

0.22

0.12

0.12

0.17

0.15

News Headline Generation (tldr_news)

tldr_news

rouge

0.43

0.18

0.17

0.17

0.1

Linguistic Acceptability (CoLA)

glue_cola

accuracy

0.87

0.87

0.84

0.39

0

Customer Support Automation

customer_support

accuracy

0.99

0.87

0.76

0

0

Open-Ended Sentence Completion (hellaswag)

hellaswag_processed

rouge

0.25

0.13

0.11

0.14

0.05

WikiSQL (SQL Generation)

wikisql

rouge

0.98

0.93

0.89

0.27

0.35

Sentiment Detection (SST2)

glue_sst2

accuracy

0.95

0.95

0.89

0.65

0

Question Answering (drop)

drop

rouge

0.76

0.42

0.11

0.11

0.02

News Summarization (cnn)

cnn

rouge

0.25

0.17

0.18

0.17

0.14

Grade School Math (gsm8k)

gsm8k

rouge

0.47

0.04

0.39

0.29

0.05

Structured-to-Text (viggo)

viggo

rouge

0.52

0.37

0.37

0.36

0.15

Structured-to-Text (e2e_nlg)

e2e_nlg

rouge

0.54

0.51

0.46

0.49

0.18

Chemical and Disease Recognition (bc5cdr)

bc5cdr

rouge

0.97

0.89

0.73

0.7

0.18

Multiple Choice Sentence Completion (hellaswag)

hellaswag

accuracy

0.82

0.8

0.51

0

0.03

Legal Clause Classification

legal

rouge

0.52

0.3

0.27

0.2

0.03

Sentiment Detection (COVID-19)

covid

accuracy

0.77

0.32

0.31

0

0

Natural Language Inference (MNLI)

glue_mnli

accuracy

0.87

0.81

0.51

0.27

0

News Topic Identification (dbpedia)

dbpedia

dbpedia

0.99

0.04

0.06

0.01

0

Code Generation (magicoder)*

magicoder

humaneval

0.11

0.82

0.49

0.05

0.01

ROUGE is a common metric for evaluating the efficacy of LLMs but is less representative of the model’s ability in this case. GSM8K would have required a different evaluation metric.

* We observed that when performing fine-tuning the Mistral-7B base model with our default configuration on two datasets (MagicCoder and WebNLG) the performance was significantly lower compared to GPT-4. We are continuing to investigate these results further and details will be shared in our upcoming paper.

Of the 27 adapters we provided, 25 match or surpass GPT-4 in performance. In particular, we found that adapters trained on language-based tasks, rather than STEM-based ones, tended to have higher performance gaps over GPT-4.

Try the Fine-Tuned Models Yourself

You can query and try out all the fine-tuned models in the LoRA Land UI. We also uploaded all the fine-tuned models on HuggingFace so that you can easily download and play around with them. If you want to try querying one of our fine-tuned adapters from HuggingFace, you can run the following curl request:

```
curl -d '{"inputs": "What is your name?", "parameters": {"adapter_id": "my_organization/my_adapter", "adapter_source": "hub"}}' \
    -H "Content-Type: application/json" \
    -X POST https://serving.app.predibase.com/$PREDIBASE_TENANT_ID/deployments/v2/llms/llama-2-7b-chat/generate \
    -H "Authorization: Bearer ${API_TOKEN}"
```

The Rise of Small, Task-Specific Models

While organizations have been racing to productionize LLMs ever since the debut of ChatGPT in late 2022, many chose to avoid the pitfalls of commercial LLMs in favor of trying to build their own. Leading commercial LLMs can help teams develop proofs of concept and get started generating quality results without any training data but come with trade offs in terms of:

  1. Control: Giving up ownership of model weights
  2. Cost: Paying an unsustainable premium for inference
  3. Reliability: Struggling to maintain performance on a constantly changing model

To overcome these challenges, many teams turned to either building their own Generative Pre-training Transformer (GPT) from scratch or to fine-tuning an entire model with billions or trillions of parameters. These approaches, however, are prohibitively expensive and require enormous computational resources that are out of reach for most organizations.

This in turn has driven the adoption of smaller, specialized LLMs and efficient approaches like the above-mentioned PEFT and QLoRA. These techniques reduce the training requirements–in terms of compute, time, and data–by orders of magnitude and make developing fine-tuned LLMs like those hosted in LoRA Land accessible even for teams with limited resources and smaller budgets. 

We built LoRA Land to provide a real world example of how smaller, task-specific fine-tuned models can cost-effectively outperform leading commercial alternatives. Predibase makes it much faster and far more resource efficient for organizations to fine-tune and serve their own LLMs, and we're happy to provide the tools and infrastructure for teams that want to start deploying specialized LLMs to power their business.

Related Articles

Join Our Community!