How to Fine-tune LLaMa-2 on Your Data with Scalable LLM Infrastructure

July 20, 2023 · 6 min read
Screenshot 2023-07-20 at 6.11.33 PM
Arnav Garg
Arnav Garg

Building on the momentum of open-source LLama, Meta recently released LLaMA-2, a set of improved large language models with 7B, 13B, and 70B parameter variants that are on par with ChatGPT. And while LLaMA-2 is better in terms of quality than LLaMA, has double the context length, and is commercially viable, it typically won’t work well out of the box for your specific ML task since it was trained on general text data from the web during the pretraining stage. 

If you want to use it for your use case, you will likely want to fine-tune the model on your data. However, fine-tuning comes with its own set of challenges preventing most engineers and data scientists from being successful.

Key challenges that teams face when fine-tuning LLMs:

1. Writing code for fine-tuning: The learning process for fine-tuning LLMs is different from the traditional fine-tuning of text-to-text-based models. Even though there are other libraries like transformers from HuggingFace and trl that support fine-tuning LLMs, it still involves writing code manually that is challenging to get right for enterprise fine-tuning tasks for these reasons:

  • The lack of a unified interface for configuring the prompt and task descriptions and updating your dataset to correctly reflect these parameters.
  • The number of training parameters you need to manually configure for your specific dataset.
  • The need to manage and scale out your own infrastructure for fine-tuning models in a distributed fashion, which is non-trivial. Typically, you will see good fine-tuning performance on your task with a model that has at least 7B parameters. However, you cannot fit a 7B parameter model on a single GPU with 24 GB of memory (7B * 4 bytes = ~25 GB) unless you use half-precision initialization. Beyond loading the model onto the GPU, you typically need at least double the amount of GPU memory for your activations and optimizer state to train the model even with just a batch size of 1. Solving all of these challenges effectively requires relatively extensive knowledge of distributed training.

2. Acquiring Computational Resources: Large language models are resource-intensive and require substantial computational power, memory, and time for fine-tuning. Not every organization or individual has access to such resources, which can limit the widespread adoption of fine-tuning.

3. Managing Distributed Training: Most large language models can’t fit on a single GPU unless you’re using GPU instances like A100s. This means that you will have to move from regular data parallel training to model parallel or pipeline parallel model training where the model weights are sharded across GPU instances. This is typically done through tools like Deepspeed which exist in the open source, but configuring all of the Deepspeed parameters (of which there are many) correctly for your specific model choice can be fairly challenging. This is because there are high chances that you will run into CPU/GPU Out of Memory issues, or you will end up significantly underutilizing your GPUs because of unnecessary offloading making the training process very expensive.

Predibase makes fine-tuning easy and solves infra headaches

We have architected Predibase to solve the challenges of fine-tuning in these ways:

  1. Abstracting away writing the code for fine-tuning using Ludwig, an open-source project that lets you define model training through configurations. This makes it easy to iterate on your fine-tuning journey, try different parameter-efficient strategies, experiment with different variants of the prompt you are using for fine-tuning, and tie in the ability to prompt and fine-tune all at the same time.
  2. Managing prompt template iterations: The choice of prompt can have a significant effect on how well the model can be trained. For example, when fine-tuning LLMs that were previously instruction-tuned like LLaMA-2-chat, training is more effective when using the same prompt template as the LLM was trained with. In Predibase, the prompt template, data string, and task are treated as separate concepts, so it's trivial to iterate on each independently without having to rewrite your preprocessing code or manually maintain many different versions of your data.
  3. Automagically right-sizing compute resources to scale LLM fine-tuning jobs to your tasks, both in terms of the dataset and in terms of the model size, so that your fine-tuning task is always successful. This takes away the hassle of setting up your own infrastructure for training, figuring out how to perform distributed model training at scale, and running into a variety of challenges with memory-pressure issues on both CPU and GPU memory.
  4. Distributed training out-of-the-box: Predibase handles generating the appropriate Deepspeed configuration for model sharding across GPUs, enables half-precision training, appropriate offloading of parameters and optimizer states to maximize throughput while maximizing CPU and GPU utilization, and a lot more. All of this is done behind the scenes so you don’t have to worry about configuring these parameters. This makes your training process work reliably and in a scalable manner while keeping costs low.

Three steps to fine-tune Llama-2

Predibase makes it very easy to fine-tune your large language model and customize it to your task. It just requires 3 steps:

  1. Connecting your dataset,
  2. Configuring training parameters such as the choice of LLM, prompt, learning rate, and batch size,
  3. Hitting the train button.

For this short tutorial, we will fine-tune LLaMA-2-7b on the Alpaca dataset using LoRA for parameter-efficient fine-tuning.

1. Connect Your Dataset

The first step in Predibase is to connect your data. Download the alpaca dataset from here. Once you have your data downloaded, you can transform the raw data using this python script - the goal is to merge the input column into the instruction column so that we can have 1 input text feature and 1 output text feature in our dataset for fine-tuning.

import pandas as pd

# Read in the data
df = pd.read_json('alpaca_data.json')

# Merge the instruction and input columns into one
df['instruction'] = df.apply(
    lambda row: (row['instruction'] + ' ### Input: ' + row['input']), axis=1
)

# Drop the input column since it's now merged into the instruction column
df.drop(['input'], axis=1, inplace=True)

# Write the data to a csv file
df.to_csv('./alpaca_data_cleaned.csv', index=False)

Finally, you can upload your dataset to Predibase. I uploaded my CSV file to S3 and imported it into Predibase. 

To get started, install the Predibase SDK using the following command:

pip install -U predibase

Next, using your Predibase API token to initialize the Predibase client:

from predibase import PredibaseClient
pc = PredibaseClient(token="{your_api_token}")

Setup your S3 credentials using the Predibase client and import your dataset into Predibase:

connection = pc.create_connection_s3('S3 Connection', '{AWS Access Key ID}', '{AWS Secret Access Key}')
connection = pc.get_connection('S3 Connection')
dataset = connection.import_dataset("s3://{Bucket Name}/{Path to Object}",  "Alpaca")

Once your data is connected, Predibase has an informative dataset previewer in the Predibase app that gives you statistics about your dataset and a way to view a small sample. In particular, what is useful here is that the mean length for the instruction column is 16 words with a max of 266 words, while the output column has a mean of 44 words with a max of 491 words. These statistics are useful to decide how many tokens to keep during fine-tuning since we want to try and train over as many of the complete instruction-output pairs as possible.

data viewer

The Predibase app provides useful statistics about your dataset.

2. Choose the LLM model and parameters

You can use the following script to kick-start fine-tuning in Predibase.

import yaml

config = yaml.safe_load("""
model_type: llm
base_model: meta-llama/Llama-2-7b-hf
input_features:
  - name: instruction
    type: text
    preprocessing:
      max_sequence_length: 256
output_features:
  - name: output
    type: text
    preprocessing:
      max_sequence_length: 512
prompt:
  task: Write a response that appropriately completes the request.
  template: |-
    Below is an instruction that describes a task. {task}
    ### Instruction: {sample_input}
    ### Response:
adapter:
  type: lora
  r: 8
  alpha: 16
  dropout: 0.05
trainer:
  type: finetune
  epochs: 3
  optimizer:
    type: adamw
    weight_decay: 0
  batch_size: 1
  learning_rate: 0.00005
  eval_batch_size: 4
  steps_per_checkpoint: 200
  learning_rate_scaling: constant
  learning_rate_scheduler:
    warmup_fraction: 0.01
  gradient_accumulation_steps: 16
""")

engine = pc.get_engine(name="default_engine")

repo = pc.get_model_repo(name="Instruction Tuning LLaMA-2-7b on Alpaca using LoRA")
md = repo.head().to_draft()
md.config = config
md.dataset = pc.get_dataset(dataset_name='Alpaca', connection_name='S3 Connection')

result = md.train_async(engine=engine)
try:
    result.get()
except InterruptedError:
    result.cancel()

The first step is to define our Ludwig configuration for fine-tuning. Specifically, we:

  1. Specify the model type as llm and set the base_model name to meta-llama/Llama-2-7b-hf. This ensures that we are going to fine-tune Llama-2-7b.
  2. Define our input and output feature names based on the alpaca dataset we uploaded to Predibase, and set the max_sequence_length for each of them to restrict the number of tokens to train on.
  3. Configure your prompt to have a template and task. This is important because we want to instruct tune Llama-2-7b to follow instructions, just like how ChatGPT responds to questions. Beneath the surface, this will wrap all of your data in your instruction input feature with this template and task. For example, if your instruction is:
Explain why the given definition is wrong. ### Input: A mole is an animal that lives underground.

Then, your prompt and task will convert it to:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction: Explain why the given definition is wrong.
### Input: A mole is an animal that lives underground.
### Response: 

4. Optionally pick an adapter, which enables parameter-efficient fine-tuning. In this case, we select the LoRA adapter and configure some associated parameters. This enables us to fine-tune a small set of weights efficiently for this specific task, without running into the issue of catastrophic forgetting.

5. Configure your training parameters, such as batch size, learning rate, number of epochs, and more.

Next, we use the Predibase Client to select the default engine (our term for compute) we want to use for training, create a new model repository to track our fine-tuning experiments, and start a new model draft. The model draft is assigned the Ludwig config we created in the previous step, as well as the ID of the dataset we want to train on. Finally, we start training our model when we call md.train_async().

That’s it. From here, Predibase allocates the necessary compute for your training job, tunes your batch size to maximize throughput so that your GPUs are saturated, uses Deepspeed in the background to scale and distribute your training job over multiple GPUs, and produces training metrics for your training, validation, and test sets so you can follow the fine-tuning process. The learning curves will look something like this when you’re fine-tuning.

LLama learning curves

Your learning curves will look like this when fine-tuning.

If you prefer using a UI for fine-tuning, Predibase also supports a model editor for configuring your parameters through the Predibase app:

Model editor llama

If you prefer using a UI for fine-tuning, Predibase also supports a model editor for configuring your parameters.

Start Querying Your Customized LLama-2 with Predibase’s Free Trial

Interested in trying this out on your own? Sign-up for a 14 day free trial of Predibase (no credit card required!).  We offer two free trial options based on your interest:

  1. If you just want to query LLaMa-2 on your own data, then get going instantly with our fully-hosted Predibase Cloud trial.
  2. If you want to test drive the full experience of fine-tuning LLaMa-2, then contact us to upgrade to our premium VPC trial for free.

Related Articles

Join Our
Slack Community!