Predibase Inference Endpoints

Predibase supports state-of-the-art, efficient inference for both pre-trained and fine-tuned models at the same, flat per-token price. It is enabled by LoRA Exchange (LoRAX), Predibase’s unique technology that allows us to have the most cost-effective fine-tuned model serving in the market.

For comparison, OpenAI GPT-3.5 charges 8x more for inference on their fine-tuned models than the base model. And most other OSS LLM infrastructure companies don’t give you the option, forcing you to use an expensive $ / GPU-hour pricing model for fine-tuned models.

Base models supported includes:

Code-llama-13b-instruct (new!)
Code-llama-34b (coming soon)
Mistral-7b-instruct (new!)
Zephyr-7b-beta (new!)
Fine-tuned Model SizePrice per 1k tokens (input + output)TogetherHuggingFace
Up to 7B$0.0002$ / gpu-hour$ / gpu-hour
Up to 13B$0.00025$ / gpu-hour$ / gpu-hour
Up to 70B$0.001$ / gpu-hour$ / gpu-hour

Dedicated Deployments

If your use-case or organization requires it, you also have the option to spin up a private, dedicated deployment for your fine-tuned model where you will be billed on a $/gpu-hour basis. Dedicated deployments, like Predibase's shared deployments, are built on top of LoRA Exchange (LoRAX), allowing you to serve multiple fine-tuned models in your dedicated deployment at no additional cost.

Predibase Training Costs

Predibase offers state-of-the-art fine-tuning and charges for training based on the underlying costs. Expected costs vary depending on the dataset, model being fine-tuned, the compute resources allocated, and the overall time for fine-tuning. You will be billed on a $ / GPU-hour basis for training jobs. 

