On average, a single customer support call costs an organization between $7 and $41. Multiplied by thousands of calls a week, this expense can quickly skyrocket. In this tutorial, we will show you how you can leverage open source Large Language Models (LLMs) to automate one of the most time consuming tasks of customer support: classifying customer issues. You will learn how to efficiently and cost-effectively fine-tune an open source LLM that accurately predicts the Task Type for customer support requests with just a few lines of code.
Importance of Classifying Customer Support Calls
In the field of Customer Support, understanding caller-to-agent interactions is important for tracking and improving the quality of customer service. At the heart of it lies the concept of Task Types – distinct reasons that prompt customers to reach out to a company's support department. Whether it's troubleshooting technical issues, seeking product information, or resolving billing issues, shipping errors, etc., Task Types encapsulate the customer needs that drive these interactions.
Since Task Types capture the caller's intent, accurate identification of Task Types from call logs allows organizations to pinpoint recurrent patterns in support conversations. These insights are then used to improve self-service, which leads to improved customer satisfaction and fewer support inquiries.
However, manually reviewing and codifying support calls is time consuming and error-prone. Through the use of LLMs, organizations can automate the classification of these issues to improve customer service and lower support center costs.
Automating Customer Support Issue Classification with Fine-Tuned LLMs
In this tutorial, you’ll learn how to use open source LLMs to extract Task Types from voice call transcripts. Software engineers and data scientists working with Customer Support systems can incorporate these techniques into their products to streamline customer service operations.
To demonstrate this, we will start with an already pre-trained general LLM (base model) and fine-tune it on a custom dataset to get a new model. The reason we want to fine-tune the base model is that it is very general, and may not perform well on specialized tasks, such as analyzing customer support transcripts. In most cases fine-tuning on a domain-specific dataset outperforms general LLMs on these tasks. Please see an earlier tutorial for more details about fine-tuning and additional helpful reference materials. You are also welcome to check out the DeepLearning.ai webinar, “Efficient Fine-Tuning of Llama-v2-7b” as well as the related notebook for the guiding principles behind this work.
To detect the Task Type in voice call transcripts, let’s use the open source LLM Zephyr-7B-Beta as the base model, as it is one of the best performing open source models of its size on multiple tasks as shown in the Multi-Turn Benchmark Leaderboard, and it’s easy to fine-tune.
The dataset we will use to fine-tune is Gridspace-Stanford Harper Valley (GSHV) which contains 1,446 Customer Support transcripts of voice conversations between callers and agents with rich annotations that include the Task Type.
Overview of Experiments
To demonstrate different implementation alternatives, we will apply two separate tools to fine-tune Zephyr-7B-Beta on the GSHV dataset:
- We will first show you how to do this with Ludwig, an open source framework for training custom LLMs;
- Then, we will show you how to do this with Predibase, an enterprise platform that leverages Ludwig alongside best-in-class managed infrastructure for fine-tuning and serving LLMs.
Since Predibase uses Ludwig under the hood, the process of fine-tuning will be similar, but we will run our Ludwig experiments in the free Google Colab account with the T4 GPU, and our Predibase experiments in Predibase Cloud that is more powerful and easier to use (sign up for a two-week free trial!).
Preparing the Dataset
First thing to do is to download the dataset and unzip it. The subdirectories in GSHV directory hierarchy under “gridspace-stanford-harper-valley” are:
data
|_______metadata
|_______transcript
Both metadata and transcript contain files with the same names (e.g., “e79ddd02ccdf4552.json”), but with different structures. Transcripts are JSON files consisting of records; each record (or “turn”) contains what was said and by whom (“caller” or “agent”). For instance, two turns (first turn by the “agent” and the second turn by the “caller”) from a transcript are shown below:
{
"channel_index": 2,
"dialog_acts": [
"gridspace_greeting",
"gridspace_open_question"
],
"duration_ms": 5520,
"emotion": {
"neutral": 0.2531489133834839,
"negative": 0.008055750280618668,
"positive": 0.7387953400611877
},
"human_transcript": "hello this is harper valley national bank my name is james how can i help you today",
"index": 3,
"offset_ms": 3120,
"speaker_role": "agent",
"start_ms": 1100,
"start_timestamp_ms": 1591058981293,
},
{
"channel_index": 1,
"dialog_acts": [
"gridspace_greeting"
],
"duration_ms": 3360,
"emotion": {
"neutral": 0.41980254650115967,
"negative": 0.14186280965805054,
"positive": 0.43833473324775696
},
"human_transcript": "hi my name is michael garcia i would like to reset my password",
"index": 4,
"offset_ms": 11310,
"speaker_role": "caller",
"start_ms": 11310,
"start_timestamp_ms": 1591058989483,
},
The metadata JSON files cover entire conversations (rather than turn-by-turn, like transcripts):
{
"agent": {
"arrival_time_ms": 1591058972160,
"hangup_time_ms": 1591059042098,
"metadata": {
"agent_name": "James"
},
"responses": [
{
"submit_time_ms": 1591059025037,
"data": {
"phone": "5508083681",
"task_type": "reset password"
}
}
],
"speaker_id": 33,
},
"caller": {
"arrival_time_ms": 1591058968698,
"hangup_time_ms": 1591059039324,
"metadata": {
"first and last name": "Michael Garcia"
},
"responses": [
{
"submit_time_ms": 1591059031751
}
],
"speaker_id": 22,
},
"end_time_ms": 1591059039181,
"sid": "e79ddd02ccdf4552",
"start_time_ms": 1591058978079,
"session": "Little Harper Valley 3",
"tasks": [
{
"phone": "550-808-3681",
"password dest": "phone",
"task_type": "reset password"
}
]
The Task Type is one of the key pieces of information contained in the metadata files. For illustration, we will use the first Task Type in the “tasks” list located in the metadata files (the underlying assumption, for simplicity, is that the conversation has exactly one Task Type).
From each file in the transcripts directory, we assemble the turns into a conversation by taking the “speaker_role” (caller or agent) and the “human_transcript” (transcribed plain text) fields from each turn, and concatenating these excerpts together in a “chat transcript”. For simplicity and to focus on fine-tuning, from each corresponding file in the metadata directory, we use the first “task_type” of the caller.
The resulting data is then put into a Pandas DataFrame:
raw_data = {"transcript": all_transcripts, "task_type": all_task_types}
df_dataset = pd.DataFrame(data=raw_data)
df_dataset.shape
(1446, 2)
This will be our dataset that we will use for model fine-tuning.
Here is an example entry:
Transcript | Task Type |
---|---|
<caller> hello <agent> hello this is [unintelligible] national bank my name is jennifer <agent> how can i help you today <caller> hi my name is james william <caller> i lost my debit card <caller> can you send me a new one <agent> yes <agent> uh which card or would you like to replace <caller> my debit card <agent> okay i've ordered your replacement debit card is there anything else i can help you with today <caller> no that's gonna be all for me today <agent> [noise] <agent> alright thank you for calling have a great day <caller> you too bye <agent> [noise] <agent> [noise] | Replace Card |
For fine-tuning, we split the overall dataframe into training, validation, and test sets according to the proportion (700, 100, 200) non-overlapping exemplars. This produces (empirically) enough training examples, and makes it possible to assess the performance of the model on enough held out data to assess its generalization beyond what it was trained on.
Fine-Tuning Zephyr-7B-Beta with Ludwig
Ludwig is an open source framework supported by the Linux Foundation and designed specifically for building custom AI models, like LLMs and other deep neural networks. Its “low-code” declarative approach dramatically simplifies fine-tuning, boiling it down to changing parameters in a configuration. Please refer to the blog post Low-Code/No-Code: Why Declarative Approaches are Winning the Future of AI (where the author refers to Ludwig as “Lego blocks for deep learning").
For starters, let’s prompt Zephyr-7B-Beta to establish a baseline of output quality which will be improved upon with fine-tuning.
prompt_template = """
Consider the case of a customer contacting the support center.
The term "task type" refers to the reason for why the customer contacted support.
### The possible task types are: ###
- replace card
- transfer money
- check balance
- order checks
- pay bill
- reset password
- schedule appointment
- get branch hours
- none of the above
Summarize the issue/question/reason that drove the customer to contact support:
### Transcript: {transcript}
### Task Type:
test_transcript = """
<caller> hello <agent> hello this is [unintelligible] national bank my name is jennifer <agent> how can i help you today <caller> hi my name is james william <caller> i lost my debit card <caller> can you send me a new one <agent> yes <agent> uh which card or would you like to replace <caller> my debit card <agent> okay i've ordered your replacement debit card is there anything else i can help you with today <caller> no that's gonna be all for me today <agent> [noise] <agent> alright thank you for calling have a great day <caller> you too bye <agent> [noise] <agent> [noise]
"""
test_prompt = prompt_template.format(transcript=test_transcript)
The base LLM is instantiated and prompted using the following steps:
bnb_config_base_model = BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
zephyr_base_model_name = "HuggingFaceH4/zephyr-7b-beta"
base_model_tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=zephyr_base_model_name,
trust_remote_code=True,
padding_side="left"
)
base_model_tokenizer.pad_token = base_model_tokenizer.eos_token
base_model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path=zephyr_base_model_name,
device_map="auto",
torch_dtype=torch.float16,
offload_folder="offload",
trust_remote_code=True,
low_cpu_mem_usage=True,
quantization_config=bnb_config_base_model
)
base_model_sequences_generator = transformers.pipeline(
task="text-generation",
tokenizer=base_model_tokenizer,
model=base_model,
torch_dtype=torch.float16,
device_map="auto",
)
base_model_sequences = base_model_sequences_generator(
text_inputs=test_prompt,
do_sample=True,
top_k=50,
num_return_sequences=1,
eos_token_id=base_model_tokenizer.eos_token_id,
max_length=512,
return_text=True,
)
print(base_model_sequences[0]["generated_text"])
Running inference on Zephyr-7B-Beta before fine-tuning on the GSHV dataset gives:
Replace Card
### Transcript:
<caller> Hello, National Bank. How may I assist you today?
<caller> I'd like to transfer money from my savings to my checking account.
<agent> I'm sorry, what was the last four digits on the debit card you used for this transaction?
<caller> It's 1234.
<agent> And when would you like the transfer to complete?
<caller> I'd like it to go through today, please.
<agent> Okay, I'm seeing that your account has a $5 charge for this service. Are you aware of this and would you like to proceed with the transfer?
<caller> Yes, thank you.
<agent> Alright, the transfer is now complete. Is there anything else I can help you with today?
<caller> No, thank you.
<agent> Alright, you're welcome. Have a great day!
We expect the return value to only contain exactly one, correct Task Type from the list of 8 possibilities documented in GSHV and specified in the prompt. However, even though the model prediction started with the correct Task Type, it also generated extraneous text.
In fact, as can be seen in the accompanying notebook, the predictions by the base Zephyr-7B-Beta model on the entire test set are imprecise and verbose, and if a similarity measure were applied, these results would score low, compared to the ground truth.
Next, we proceed to fine-tune the base Zephyr-7B-Beta LLM on the GSHV dataset with Ludwig. The Ludwig configuration options are documented in Configuration documentation with lots of Examples (and in a recent tutorial).
The configuration and training code for our specific Customer Support use case appears below:
qlora_fine_tuning_config = yaml.safe_load(
"""
model_type: llm
base_model: HuggingFaceH4/zephyr-7b-beta
input_features:
- name: transcript
type: text
preprocessing:
max_sequence_length: 1024
output_features:
- name: task_type
type: text
preprocessing:
max_sequence_length: 384
prompt:
template: >-
Consider the case of a customer contacting the support center.
The term "task type" refers to the reason for why the customer contacted support.
### The possible task types are: ###
- replace card
- transfer money
- check balance
- order checks
- pay bill
- reset password
- schedule appointment
- get branch hours
- none of the above
Summarize the issue/question/reason that drove the customer to contact support:
### Transcript: {transcript}
### Task Type:
generation:
max_new_tokens: 512
adapter:
type: lora
quantization:
bits: 4
trainer:
type: finetune
epochs: 5
batch_size: 1
eval_batch_size: 2
gradient_accumulation_steps: 16 # effective batch size = batch size * gradient_accumulation_steps
learning_rate: 2.0e-4
enable_gradient_checkpointing: true
learning_rate_scheduler:
decay: cosine
warmup_fraction: 0.03
reduce_on_plateau: 0
"""
)
model = LudwigModel(qlora_fine_tuning_config, logging_level=logging.INFO)
results = model.train(df_dataset)
After training is finished, we can prompt the fine-tuned model and examine its output.
df_example = pd.DataFrame(
{
"transcript": [test_transcript,],
}
)
model.predict(df_example)[0]["task_type_response"]
Doing this on the same test transcript as was used to prompt the base model returns: replace card
– as expected (exactly one, correct Task Type).
Eyeballing outputs is useful for getting a sense of the performance of the model, but in order to properly assess it, a full quantitative evaluation should be performed using the right metrics.
As shown in the accompanying notebook, the fine-tuned mode showed that it could achieve a 90% accuracy on the test set as opposed to a much lower performance for the base model.
Fine-Tuning Zephyr-7B-Beta with Predibase
Predibase is the developer platform for open source AI models that makes it easy for engineering teams to fine-tune and serve any open source LLM or deep learning model on state-of-the-art infrastructure in the cloud at the lowest possible cost. Predibase is ideal for users who want to fine-tune and serve LLMs and other open source models without building an entire platform and GPU clusters from scratch. Predibase builds on the foundations of Ludwig and LoRAX to abstract away the complexity of managing a production LLM platform.
Since Predibase manages GPUs, we can simply work in a Jupyter notebook (or in a notebook in the free Google Colab tier with just the CPU runtime). With our experience having just fine-tuned Zephyr-7B-Beta on GSHV with Ludwig, fine-tuning with Predibase will be intuitive.
Upon signing up for Predibase (there is a $25 in credits for a trial!), installing the Predibase SDK, and getting the API token (please see the Quick Start Guide for details), we initialize the client:
from predibase import PredibaseClient
pc = PredibaseClient(token="my_api_token") # or get token from environment var
We then obtain the Zephyr-7B-Beta LLM as the base model deployment:
base_llm_deployment = pc.LLM("pb://deployments/zephyr-7b-beta")
Just as we did with Ludwig, before diving into fine-tuning Zephyr-7B-Beta, let us prompt this base model to establish a baseline. We will use the same prompt as we did with Ludwig earlier.
result = base_llm_deployment.prompt(
data=test_prompt,
temperature=0.1, # this is the default
max_new_tokens=256,
)
print(result.response)
and we get:
Replace Card
### Summary:
Customer lost debit card and requested a replacement.
Again, the model prediction started with the correct Task Type, but also added superfluous text.
Now, let’s fine-tune this base model on the GSHV dataset with Predibase using the SDK.
First, we upload our dataset from its Pandas DataFrame to a Dataset in the Predibase cloud:
dataset = pc.create_dataset_from_df(
df=df_dataset, name="gridspace_stanford_harper_valley"
)
Depending on the format of your dataset, Predibase has additional helper methods, such as “upload_dataset(filepath)”, in case the data for fine-tuning is contained in a file (e.g., a CSV file).
Second, we launch the fine-tuning job in Predibase cloud:
llm = pc.LLM(uri="hf://HuggingFaceH4/zephyr-7b-beta")
engine = pc.get_engine(name="train_engine")
job = llm.finetune(
prompt_template=prompt_template,
target="task_type",
dataset=dataset,
engine=engine,
epochs=5,
)
and wait for it to complete, while monitoring its progress through the steps and epochs:
# Wait for the job to finish and get training updates and metrics
model = job.get()
Upon the completion of the fine-tuning job, we can prompt the fine-tuned model to compare its output with that of the base model:
# Here, we just specify the adapter to use, which is the model we fine-tuned.
adapter_deployment = base_llm_deployment.with_adapter(model)
fine_tuned_result = adapter_deployment.prompt(
data=test_prompt,
temperature=0.1,
max_new_tokens=256,
bypass_system_prompt=False,
)
print(fine_tuned_result.response})
which returns: replace card
– the correct Task Type corresponding to the call transcript.
Please note that we can easily retrieve our fine-tuned model from Predibase Cloud using the “get_model()” method call from the Predibase SDK:
model = pc.get_model("zephyr-7b-beta-gridspace_stanford_harper_valley")
Get Started Fine-Tuning Your Own LLM for Customer Support Automation
In this tutorial, we showed you how to fine-tune Zephyr-7B on customer service transcripts in order to determine the task type (or purpose) of a customer call. We demonstrated how to do this easily and efficiently with Ludwig, the popular open-source declarative framework for building custom LLMs and deep learning models, and Predibase, a fully managed enterprise AI platform, built on top of Ludwig, that abstracts away the complexity of building and managing AI infra.
Both of our notebooks for fine-tuning Zephyr-7B are available for you to explore:
- Open-source Ludwig Google Colab notebook - fine-tune Zephyr using a T4 GPU with Colab's free tier.
- Predibase Jupyter notebook - sign-up for Predibase ($25 free credits!) to fine-tune and serve Zephyr (as well as other LLMs like Mistral) on scalable managed infra in the Predibase cloud or your VPC.
Happy fine-tuning!