Introducing the First End-to-End Platform for Reinforcement Fine-Tuning

March 19, 2025 · 4 min read
RFT Launch Blog Hero
Devvret Rishi
Devvret Rishi
Travis Addair
Travis Addair

To learn more about RFT, sign up for our webinar on March 27th.

Today we're launching the first end-to-end platform for reinforcement fine-tuning (RFT). 

When we started Predibase four years ago, our goal was to make it easy for any developer to build and deploy fit-for-purpose models on their data. Since then, we’ve heard the feedback loud and clear – people see amazing results fine-tuning models to their task but access to high quality labeled data is most frequently the blocker. That changes today.

OpenAI first gave a glimpse of a managed RFT solution during the 12 days of OpenAI, but has been in research preview for just their proprietary models. Since then, DeepSeek-R1 captivated the world’s attention by showing how reinforcement learning can be one of the most powerful techniques to elevate the performance of LLMs in a data and compute-efficient way, and open-sourcing novel techniques like GRPO.

Now, we’re releasing a fully-managed, serverless and end-to-end platform for reinforcement fine-tuning

Try it yourself today in our interactive RFT playground.

What is RFT & Why We’re Excited

Reinforcement fine-tuning allows an LLM to learn from reward functions that steer and guide the model to outcomes, rather than purely labeled examples as in SFT. The technique works particularly well for reasoning tasks, where models like DeepSeek-R1 or OpenAI o1 perform well, and where you have smaller amounts of labeled data but can write rubrics to help score performance.

RFT-Flow-Chart

In our experience, RFT delivers exceptional results for tasks like code generation, where correctness can be objectively verified through execution, and complex RAG scenarios where factual accuracy and reasoning quality are paramount. In these areas, we've consistently seen RFT provide a meaningful performance lift over even the most capable base LLMs.

What's particularly exciting is that RFT enables continuous improvement—as your reward functions evolve and as you gather more feedback data, your models can keep getting better at solving your specific challenges.

Introducing the Complete RFT Platform

We aimed for our platform to be the first to offer two key things: 

  1. Fully managed, serverless infrastructure
  2. An end-to-end experience that goes from data to high-performance serving on the Predibase Inference Engine

Fully Managed and Serverless

Training Architecture V5

RFT leverages a number of components built natively into the Predibase platform to provide models that continuously learn. At every step, we create a fine-tuned adapter to reflect the latest stage of training and our multi-LoRA serving framework LoRAX can instantly load each checkpoint at each step to generate the latest set of completions. This allows our service to constantly evaluate the latest version of the fine-tuned models with 0 latency, and we overlap training & generation with streaming micro-batches, allowing us to keep GPU utilization near 100% during training.

We also have found that baking in a limited supervised fine-tuning warm-up step to the training recipe before starting GRPO improves performance, and have natively integrated SFT into our RFT workflow. Finally, since every task is unique and often requires the flexibility to write your own custom reward functions we’ve also created a secure reward server to execute user-code in isolated environments in parallel.

End-to-End Platform

In addition to training models with RFT, it’s crucial our customers have a best-in-class serving solution for all models our customers create. The Predibase Inference Engine now natively supports all RFT models trained inside the platform, and is compatible with features like our deployment monitoring and Turbo LoRA to accelerate the throughput of reasoning models trained via reinforcement learning. 

Starting today with Predibase, you can truly go from data to a deployed model by connecting an individual dataset into Predibase, training with an SFT warm-up and refining with RFT, and then deploying a high throughput serverless model into production backed by industry-grade SLAs.

Case Study: Beating o1 and DeepSeek-R1 by 3x at GPU Code Generation

To demonstrate the power of our RFT platform, we developed a specialized model for translating PyTorch code to Triton, a middle-ground implementation to CUDA. This is a task that most LLMs struggle with, requiring deep understanding of both frameworks and complex reasoning about computational efficiency.

Our training process combined cold-start supervised fine-tuning with reinforcement learning (GRPO) and employed curriculum learning to progressively tackle increasingly difficult tasks. We detailed some of these techniques in an earlier blog post.

We ran our benchmarks on the Kernelbench dataset–which has 250 diverse tasks designed to assess an LLM’s ability to transpile code into a valid, efficient kernel–and our model delivered remarkable results.

PyTorch-Triton-Benchmarks-2

In our implementation, we fine-tuned a relatively small model that can fit onto a single GPU, Qwen2.5-Coder-32B-instruct, and benchmarked kernel correctness against other larger foundation models including DeepSeek-R1, Claude 3.7 Sonnet and OpenAI o1. 

What makes these results particularly impressive is that our model achieved a 3x higher correctness than OpenAI o1 and DeepSeek-R1 and more than 4x the performance of Claude 3.7 Sonnet–despite operating at an order of magnitude smaller footprint.

We’re excited to open source our model today on HuggingFace, alongside the launch of RFT in the platform.

The future of LLM customization is here

Reinforcement fine-tuning represents a significant leap forward in the evolution of LLMs, enabling models to learn from reward functions rather than just labeled examples. With Predibase's end-to-end RFT platform, we're democratizing access to this powerful technique, making it accessible to developers and enterprises alike.

Our platform eliminates the infrastructure complexity and technical barriers so you can just focus on building your next breakthrough. By combining fully-managed, serverless infrastructure with an integrated workflow from data to deployment, we're enabling organizations to build high-performing custom models even with limited labeled data. 

The results speak for themselves: our PyTorch-to-Triton model demonstrates that RFT can produce specialized models that outperform much larger foundation models on specific tasks, while requiring significantly fewer resources.

Today, we’re releasing both the v1 of the platform as well as a sneak peek at some of the features we’re actively developing like natural language reward functions. We invite you to experience the power of reinforcement fine-tuning for yourself or request a demo to see the latest application of the technology. Sign up today or join our upcoming webinar on March 27th where we'll demonstrate how to build custom models using our RFT platform.

The future of model customization is here—and it's powered by reinforcement fine-tuning on Predibase.

Related Articles

Introducing Reinforcement Fine-Tuning.

Train The Best Models, Serve at Maximum Speed.

Unmatched accuracy and speed with our end-to-end training and serving infra.
Now with reinforcement fine-tuning.

020 WWF
030 NUbank
040 checkr
050 MarshMcLennan
060 Forethought
070 Qualcomm
075 Sense
080 upstage
090 everstream
100 convirza
110 prosperia
130 whatfix
140 Freede
150 AIRMDR
160 amolino
170 GraphAware
180 Systemized Revenue
190 enric
200 futu
210 bardeen
220 capturi
230 figr
250 clearwater
260 ongrid
020 WWF
030 NUbank
040 checkr
050 MarshMcLennan
060 Forethought
070 Qualcomm
075 Sense
080 upstage
090 everstream
100 convirza
110 prosperia
130 whatfix
140 Freede
150 AIRMDR
160 amolino
170 GraphAware
180 Systemized Revenue
190 enric
200 futu
210 bardeen
220 capturi
230 figr
250 clearwater
260 ongrid
020 WWF
030 NUbank
040 checkr
050 MarshMcLennan
060 Forethought
070 Qualcomm
075 Sense
080 upstage
090 everstream
100 convirza
110 prosperia
130 whatfix
140 Freede
150 AIRMDR
160 amolino
170 GraphAware
180 Systemized Revenue
190 enric
200 futu
210 bardeen
220 capturi
230 figr
250 clearwater
260 ongrid
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.

Adapt and Serve Open-Source LLMs.

Train with 1,000x Less Data. Serve 10x Faster.

The Fastest Multi-LoRA Inference

Serve fine-tuned language models on autoscaling infrastructure with blazing-fast inference. Powered by LoRAX and Turbo LoRA.

Try it for free
Turbo LoRA: 4x faster than other solutions

Precision Fine-Tuning with RFT

Introducing the first reinforcement fine-tuning platform. Harness reward functions and 10 rows of labeled data to beat GPT-4.

Try it for free
View fine-tuned open-source SLMs consistently beat GPT-4.
View the leaderboard

What Our Customers Say

Predibase revolutionized our AI workflows.

"At Convirza, our workload can be extremely variable, with spikes that require scaling up to double-digit A100 GPUs to maintain performance. The Predibase Inference Engine and LoRAX allow us to efficiently serve 60 adapters while consistently achieving an average response time of under two seconds. Predibase provides the reliability we need for these high-volume workloads. The thought of building and maintaining this infrastructure on our own is daunting—thankfully, with Predibase, we don’t have to."
Read the full story

Giuseppe Romagnuolo, VP of AI, Convirza

5x cost reduction, faster than OpenAI.

"By fine-tuning and serving Llama-3-8b on Predibase, we've improve accuracy, achieved lightning-fast inference and reduced costs by 5x compared to GPT-4. But most importantly, we’ve been able to build a better product for our customers, leading to more transparent and efficient hiring practices."
Read the full story

Vlad Bukhin, Staff ML Engineer, Checkr

Seamless enterprise-grade deployment.

"With Predibase, I didn’t need separate infrastructure for every fine-tuned model, and training became incredibly cost-effective—tens of dollars, not hundreds of thousands. This combined unlocked a new wave of automation use cases that were previously uneconomical."
Read the full story

Paul Beswick, Global CIO, Marsh McLennan

The Ultimate Powerhouse for Serving Fine-Tuned SLMs

The Fastest Multi-LoRA Serving Available

Unleash 4x Faster Throughput with Turbo LoRA. Serve models at ultra-fast speeds without sacrificing accuracy.

Hundreds of Fine-Tuned Models. One GPU.

Run massive-scale inference with LoRAX-powered multi-LoRA serving. Stop wasting GPU capacity.

Effortless GPU Scaling. Peak Performance. No Surprises.

Dynamically scale GPUs in real-time to meet any inference surge—zero slowdowns, zero wasted compute. Need guaranteed capacity? Reserve dedicated A100 & H100 GPUs for enterprise-grade reliability.

Built for Mission-Critical Loads

  • Multi-Region High Availability
  • Logging and Metrics
  • Blue/Green Deployments
  • 24/7 On-Call Rotation
  • SOC 2 Type II Certified

Our Cloud or Yours

Whether you're experimenting or running mission-critical AI, we’ve got you covered with flexible deployment options built for every stage of development.

Texture

Fine-Tune Any Base Model

Seamlessly fine-tune from our expansive model library or deploy your own custom model with dedicated resources.

Predibase Square

Train Specialized SLMs with or Without Training Data

Reinforcement Fine-Tuning: Powering Continuous Iteration

Train task-specific models with minimal data requirements. RFT builds upon GRPO to enable continuous learning through live reward functions. The result? Models that achieve exceptional accuracy even with limited training data.

Schedule a Demo

Start Without Labeled Data

Fine-tune powerful models with just a few examples—no massive datasets required for rapid customization across any use case.

Models Improve Automatically

RFT enables continuous learning with reward functions and improves model performance with each iteration.

Guide Live Training

Adjust reward functions in real-time allowing immediate course correction.

CTA Background

Ready to efficiently fine-tune and serve your own LLM?

Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.