Breaking Down Apple’s Reference Architecture for GenAI: Small, Fine-tuned and Built on LoRA

June 13, 2024 · 3 min read
Apple blog tile
Devvret Rishi
Devvret Rishi

This week, Apple announced how their on-device AI models power the newest set of Apple Intelligence capabilities. We think this is the most important AI announcement of the year so far. Here's why:

The New Reference Architecture for Production AI: One small model, many fine-tuned LoRA Adapters

In their announcement, Apple introduced a reference architecture for AI systems that cracked two major problems with one elegant design: 

  1. How to leverage AI models that can be customized to do several different tasks like proofreading emails or scheduling calendar events on a phone
  2. How to pack these different customizations into a single AI small enough that it can fit on-device to preserve privacy and feel snappy.
Apple's GenAI approach includes serving many small-task specific LoRA adapters on a single LLM

Apple's GenAI approach includes serving many small-task specific LoRA adapters on a single LLM

Apple started with a relatively small language model (3B parameters) and then used a parameter-efficient fine-tuning technique called LoRA (Low-Rank Adaptation) to create multiple adapters tailored for each of the individual tasks the model needs to handle. The novel insight is that each adapter itself is really small, usually a percent or less of the weights of the original LLM, but contains all the knowledge on how to “adapt” the base model to do well on that task. LoRA fine-tuning has been the technique that we’ve used for our prior research at Predibase, and we’ve shown that LoRA adapters can consistently beat larger models like GPT-4 when applied to domain-specific tasks (learn more about LoRA adapters)

Once Apple creates all of these fine-tuned task-specific adapters, they then have a dynamic serving system where all of these adapters can be hot-swapped in and out on top of the same base Apple AI model. This approach enables a single small language model (SLM) to do many tasks with GPT-4-like performance but at a fraction of the size and significantly less expensive to train compared to training an LLM from scratch.

How to build LoRA-powered SLMs like Apple using LoRAX

If Apple’s approach of dynamically hot-swapping fine-tuned adapters on top of a single base model sounds familiar, it’s because this is exactly what Predibase introduced for open-source LLMs with our popular multi-LoRA serving project LoRAX (LoRA Exchange) at the end of last year. LoRAX is an open-source framework that enables you to serve 100s adapters on top of a single base model and GPU.

Serving 100s of LoRA adapters on a single GPU with LoRAX

Serving 100s of LoRA adapters on a single GPU with LoRAX

Since its launch, LoRAX has been used to power multi-adapter serving in our cloud, within AWS sagemaker, and adopted by our customers and community members (including Tinder) that want to host their own efficient deployments with multiple fine-tuned adapters.. We’re inspired to see Apple bring this approach on-device, and look forward to helping support the next wave of customers doing the same. 

We believe this approach of many task-specific fine-tuned adapters on top of a single base model is going to define the way all companies deploy massive AI systems going forward. We’re particularly excited to see it be deployed by customers who are deploying agentic applications, and need many specialized models that can be chained together for one end-to-end workflow. Or as Yann LeCun put it when resharing one of our posts:

“Lots of small, fine-tuned, specialized AI assistants” (Twitter)

Lots of small, fine-tuned, specialized AI assistants

At Predibase, we’ve been developing a platform with this core thesis in mind and even open-sourced LoRAX, the key part of the framework that allows you to deploy many small models multiplexed together. The good news is that now you can build your own SLMs like Apple without needing 1000s of engineers: fine-tuning LoRA adapters on Predibase is affordable (~$8), easy (2 lines of code or 2 clicks), and achieves large model quality. Check out our sample notebook for an end-to-end example of how to train and serve multiple-LoRAs for customer support automation.

As the leader in this space and with research shared by leading experts, we’ll be in the driver’s seat as organizations start to develop this next generation of fine-tuned interactions. If you’d like to start building an AI system the Apple-way today, you can get started for free in our no-code or low-code environment:

Related Articles

Join Our Community!