How Checkr Streamlines Background Checks with Fine-tuned Small Language Models on Predibase

October 3, 2024 · 5 min read
Checkr Blog Tile v2
Vlad Bukhin, Staff ML Engineer at Checkr
Vlad Bukhin, Staff ML Engineer at Checkr

Checkr is a technology company that specializes in delivering modern and compliant background checks for over 100,000 customers of all sizes. Since its founding in 2014, Checkr has leveraged AI and machine learning (ML) to enhance the efficiency, inclusivity, and transparency of the background check process.

I recently had the pleasure of speaking at the LLMOps Summit in San Francisco, where I got to share my experiences at Checkr building a production-grade LLM classification system that helps automate annotations of background checks so that customers can easily adjudicate. By using Predibase to fine-tune a small open-source language model or SLM, we achieved better accuracy, faster response times, and 5X reduction in costs compared to traditional GenAI approaches utilizing commercial LLMs. In this blog, I'll walk you through our journey at Checkr and how we made it happen.

Use case: Automating Adjudication with LLMs

Goal: Classify the most complex records with good accuracy

Adjudication is the process of reviewing background check results to determine a candidate's suitability for hiring based on a company's policies. Checkr provides an automated adjudication solution that has been shown to reduce manual reviews by 95%. We use data collected from various vendors to streamline hiring while creating a process that is unbiased, transparent, and compliant. On average, we conduct millions of background checks each month, and 98% of the data is processed efficiently using a tuned logistic regression model. 

The remaining 2%, however, are complex cases that require categorizing data records into 230 distinct categories. Our original solution for this 2% of data involved a Deep Neural Network (DNN) that was able to classify only 1% with decent accuracy, leaving 1% unclassified resulting in more work for the customer. We needed a system with better performance, so we decided to embark on an exploration of different techniques using LLMs.

Checkr Data

The data we utilize for adjudication is messy and complex, requiring heavy processing and classification.

Challenge: Deliver high accuracy, low latency, and cost efficiency 

These complex adjudication cases require careful handling and pose the following challenges and requirements:

  • Complex Data: The remaining 2% of background checks involve noisy unstructured text data that is challenging for both human reviewers and even automated models to process. Manual human reviews take hours.
  • Synchronous Task: This task is synchronous and must meet low-latency SLAs in order to provide our customers with near real-time employment insights.
  • High Accuracy: Our reports are used to make important decisions about a prospective employee's future. Therefore, it’s critical that our models are highly accurate.
  • Reasonable Inference Costs: Achieving our goals must be balanced with maintaining reasonable inference costs.
Checkr traffic patterns

We process millions of tokens a month and need a system that can efficiently scale.

Experiments and Results: From GPT-4 to Fine-tuning Small Language Models (SLMs)

Experimenting with different LLM patterns

In our experiments, I explored several approaches to improve classification accuracy with LLMs. The first experiment involved using GPT-4 as a general purpose “Expert LLM” on the 98% of easier classification cases. The model performed well achieving 87-88% accuracy. However, it only achieved 80-82% accuracy for the hardest 2% of classification cases.

When integrating RAG with our Expert LLM I was able to achieve an impressive 96% accuracy on the easier dataset that was represented well in the training set. However, accuracy decreased for the more difficult dataset due to the training set examples leading the LLM away from a better logical conclusion.

For my next experiment, I fine-tuned the much smaller open-source model, Llama-2-7b and we saw significant improvements across all metrics compared to GPT-4. The fine-tuned model achieved 97% accuracy on the easier dataset and 85% on the difficult dataset. This approach was not only highly efficient, with response times typically under half a second, but also cost-effective. 

The encouraging results motivated me to explore other methods and I experimented with combining fine-tuned and expert models. Interestingly, this didn’t net any improvements in performance.

Pattern

Model

Prompt Contents

98% "easier" dataset

2% "difficult dataset"

RTT(s)

Cost

Expert LLM

GPT-4

Charge, instructions, 230 categories

87.8%

81.8%

15

~$12k

Expert + RAG

Extend: GPT-4 + Training Set

Charge, instructions, 6 examples

95.8%

79.3%

7

~$7k

Fine-Tuned LLM

Llama-2-7b

charge

97.2%

85.0%

.5

<$800

Fine-Tuned + Expert

Llama-2-chat + GPT-4

No gain

No gain

15

Productionizing our Fine-tuned SLMs with Predibase 

During our experimentation we tested several LLM fine-tuning and inference platforms. Ultimately, we landed on Predibase as our platform of choice for their ability to deliver the best performance and reliable support. These are some of the highlights of our work together:

Most accurate fine-tuning results

Our best-performing model was llama-3-8b-instruct, a small open-source LLM that we fine-tuned on Predibase. We achieved an accuracy of 90% for the most challenging 2% of cases, outperforming both GPT-4 and all of our other fine-tuning experiments. Additionally, Predibase consistently provided highly accurate results, whereas some of the other tools we tested were less reliable across fine-tuning runs.

Lowest latency inference

Predibase is specifically optimized for delivering low latency for fine-tuned inference and consistently delivers .15 sec response times for our production traffic which is 30x faster than our GPT-4 experiments. This is critical for meeting the demands of our customers. Best of all, with open-source LoRAX under the hood, we can serve additional LoRA Adapters on Predibase without needing more GPUs.

Significant costs saving

We reduced our inference costs by 5x compared to GPT-4 by fine-tuning and serving SLMs on Predibase. As mentioned above, Predibase also enables us to expand to multiple use cases using multi-LoRA serving on LoRAX, which will further bring down costs as we scale for more use cases.

Easy-to-use platform

Beyond their robust and developer-friendly SDK, Predibase also offers a user-friendly web app. The web UI makes it easy to manage projects, track model versions, and explore model performance with visualizations like loss graphs. Recently Predibase has also deployed dashboards for production metrics which helps me understand production efficiency and performance across time. Now when I deploy another solution on the same machine I will be able to understand the performance impact immediately.   

Lessons learned fine-tuning dozens of models

Given the large size of my dataset—comprising 150,000 rows of training data—I faced a number of challenges during my fine-tuning experiments and learned a lot along the way. To help others on their fine-tuning journey, I captured my key findings below:

1. Monitor model convergence

One key takeaway from my experience was the importance of monitoring model convergence. If the model isn't converging as expected, it's a good idea to try experimenting without the auto-stopping parameter. This approach can help the model reach a global minimum instead of getting stuck at a local one. By doing so, you can ensure that the model is performing at its best and not missing out on potential improvements.

Checkr loss graph

2. Fine-tuned models are not as sensitive to hyperparameters

Deep learning models can be sensitive to hyperparameters. However, I found that during the fine-tuning process, this sensitivity was less pronounced, which was a positive discovery because it required less trials and tuning. 

3. Use short prompts to extend cost savings

With a production model in place, I experimented with prompt engineering to boost performance, but it had a minimal impact. The silver lining is that you can significantly reduce token usage by keeping prompts concise, as the model's performance is largely independent of them.

4. High temperature and low top K to identify less confident predictions

When working with classifiers, it's helpful to identify subsets of data where the model's confidence is low. This can be done by calculating a confidence interval, often using the sum of the log probabilities. However, I noticed that this method often resulted in a skewed distribution, with the model appearing overly confident and consistently returning high confidence scores.

To tackle this issue, I experimented with adjusting inference parameters. Lowering the temperature increased next-token variance while lowering top_k decreased the options it could choose from. This combination maintained precision but resulted in a broader distribution of confidence scores and helped identify less confidence predictions.

5. Predibase PEFT: Matching full fine-tuning efficiency at lower costs

In my exploration of fine-tuning methods, I compared Predibase's Parameter Efficient Fine-Tuning (PEFT) approach with traditional full fine-tuning. Predibase leverages techniques like Low-Rank Adaptation (LoRA) which reduces the number of parameters needed for fine-tuning. With this approach, I was able to reduce my training cost and time while delivering comparable performance to full fine-tuning.

Conclusion

Using Predibase we've managed to train highly accurate and efficient small models, achieve lightning-fast inference, and reduce costs by 5X compared to GPT-4. But most importantly, we’ve been able to build a better product for our customers, leading to more transparent and efficient hiring practices.

If you’d like to learn more, you can:

Related Articles

Join Our Community!