RFT in Action: The Future of AI Training

See how reinforcement fine-tuning and reward functions guide model training to maximize performance
Translating PyTorch functions into efficient Triton GPU kernels is a complex but critical task for maximizing AI performance. While PyTorch makes it easy to write high-level code, it often struggles to fully leverage GPU hardware. Triton, on the other hand, enables highly optimized GPU execution—but writing Triton code requires deep expertise in GPU programming. Bridging this gap means automating the conversion process, turning high-level PyTorch functions into fast, efficient GPU kernels without manual effort. It’s a challenge that, if solved, can dramatically accelerate AI workloads and reduce development time.

For this particular model, we fine-tuned Qwen/Qwen2.5-Coder-32B-Instruct.

Epoch

110100

Observations and Actions from the Predibase Team:

The model shows partial progress—generating code that may execute but is incorrect or inefficient. It also demonstrates tendencies toward reward hacking (like bypassing kernel work for easier wins) or making architectural mistakes that compromise performance or correctness.

Prompt

Average Reward Score at Epoch 10
PyTorch FunctionformattingcompilationcorrectnessspeedTotal
LeakyReLU0.9100.8900.7600.0002.560
Matmul for Lower Triangular Matrices0.9200.8800.9300.0002.730
Cosine Similarity Loss0.7800.9200.0000.0001.700
Viewing prompt - LeakyReLU
<|im_start|>system
You are a helpful assistant that converts PyTorch code into Triton kernels.<|im_end|>
<|im_start|>user
Convert this PyTorch module implementationinto an equivalent Triton kernel:

<torch_code>
import torch
import torch.nn as nn

class Model(nn.Module):
    """
    Simple model that performs a LeakyReLU activation.
    """
    def __init__(self, negative_slope: float = 0.01):
        """
        Initializes the LeakyReLU module.

        Args:
            negative_slope (float, optional): The negative slope of the activation function. Defaults to 0.01.
        """
        super(Model, self).__init__()
        self.negative_slope = negative_slope

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Applies LeakyReLU activation to the input tensor.

        Args:
            x (torch.Tensor): Input tensor of any shape.

        Returns:
            torch.Tensor: Output tensor with LeakyReLU applied, same shape as input.
        """
        return torch.nn.functional.leaky_relu(x, negative_slope=self.negative_slope)

batch_size = 16
dim = 16384

def get_inputs():
    x = torch.randn(batch_size, dim)
    return [x]

def get_init_inputs():
    return []  # No special initialization inputs needed
</torch_code>
The Triton kernel should:
1. Import torch, triton, and triton.language as tl and other necessary modules
2. Use @triton.jit decorator on the kernel implementation (not the entrypoint function)
3. Have proper grid and block sizes
4. Use a mask in the load/store operations
5. Use typed constants (tl.constexpr)
6. Handle tensor dimensions correctly
7. Return output matching PyTorch's implementation
8. Do not include any test code in your response, only the Triton kernel implementation and entrypoint function
The triton.language (tl) module supports the following methods: PropagateNan, TRITON_MAX_TENSOR_NUMEL, abs, advance, arange, argmax, argmin, associative_scan, atomic_add, atomic_and, atomic_cas, atomic_max, atomic_
min, atomic_or, atomic_xchg, atomic_xor, bfloat16, block_type, broadcast, broadcast_to, cast, cat, cdiv, ceil, clamp, const, const_pointer_type, constexpr, cos, cumprod, cumsum, debug_barrier, device_assert, device
*print, div*rn, dot, dtype, erf, exp, exp2, expand_dims, fdiv, flip, float16, float32, float64, float8e4b15, float8e4b8, float8e4nv, float8e5, float8e5b16, floor, fma, full, function_type, histogram, inline_asm_ele
mentwise, int1, int16, int32, int64, int8, interleave, join, load, log, log2, make_block_ptr, max, max_constancy, max_contiguous, maximum, min, minimum, multiple_of, num_programs, pair_uniform_to_normal, permute, p
hilox, pi32_t, pointer_type, program_id, rand, rand4x, randint, randint4x, randn, randn4x, range, ravel, reduce, reshape, rsqrt, sigmoid, sin, softmax, sort, split, sqrt, sqrt_rn, static_assert, static_print, stati
c_range, store, str_to_ty, sum, swizzle2d, tensor, trans, uint16, uint32, uint64, uint8, uint_to_uniform_float, umulhi, view, void, where, xor_sum, zeros, zeros_like
The entrypoint function must be named: fn_20_LeakyReLU_triton
The Triton kernel implementation (called by the entrypoint) must be named: fn_20_LeakyReLU_kernel
No computation logic should be done within the entrypoint function. All computation logic should be done within the Triton kernel implementation.
The final generated code in the response must start with <triton_code> and end with </triton_code> tags.<|im_end|>
<|im_start|>assistant

Reward Graphs

Completion and Reward Scores

Completion at Epoch 10 - LeakyReLU

total: 3.000formatting: 1.000compilation: 1.000correctness: 1.000speed: 0.000

Observation

By epoch 10, the model is able to generate a correct kernel, but it is slower than the original PyTorch implementation.

import torch
import triton
import triton.language as tl

@triton.jit
def fn_20_LeakyReLU_kernel(
    x_ptr,  # *Pointer to input tensor
    y_ptr,  # *Pointer to output tensor
    N,      # Total number of elements in the input tensor
    negative_slope,  # Negative slope for the LeakyReLU
    BLOCK_SIZE: tl.constexpr
):
    # Calculate the index of the current thread
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < N
    # Load input from global memory
    x = tl.load(x_ptr + offsets, mask=mask, other=0.0)
    # Apply LeakyReLU activation
    y = tl.where(x >= 0, x, x * negative_slope)
    # Write output to global memory
    tl.store(y_ptr + offsets, y, mask=mask)

def fn_20_LeakyReLU_triton(x, negative_slope=0.01):
    N = x.numel()
    y = torch.empty_like(x)
    # Define the grid and block size
    BLOCK_SIZE = 1024
    grid = lambda meta: (triton.cdiv(N, meta['BLOCK_SIZE']),)
    # Launch the kernel
    fn_20_LeakyReLU_kernel[grid](x, y, N, negative_slope, BLOCK_SIZE=BLOCK_SIZE)
    return y

Reward Function -

Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.
Most powerful way to train. Fastest way to serve. Smartest way to scale.