The advent of ChatGPT, and Generative AI in general, is a watershed moment in the history of technology and is likened to the dawn of the Internet and the smartphone. Generative AI has shown limitless potential in its ability to hold intelligent conversations, pass exams, generate complex programs/code, and create eye-catching images and video. While GPUs run most Gen AI models in the cloud – both for training and inference – this is not a long-term scalable solution, especially for inference, owing to factors that include cost, power, latency, privacy, and security. This article addresses each of these factors along with motivating examples to move Gen AI compute workloads to the edge.
Most applications run on high-performance processors – either on device (e.g., smartphones, desktops, laptops) or in data centers. As the share of applications that utilize AI expands, these processors with only CPUs are inadequate. Furthermore, the rapid expansion in Generative AI workloads is driving an exponential demand for AI-enabled servers with expensive, power-hungry GPUs that in turn, is driving up infrastructure costs. These AI-enabled servers can cost upwards of 7X the price of a regular server and GPUs account for 80% of this added cost.
Additionally, a cloud-based server consumes 500W to 2000W, whereas an AI-enabled server consumes between 2000W and 8000W – 4x more! To support these servers, data centers need additional cooling modules and infrastructure upgrades – which can be even higher than the compute investment. Data centers already consume 300 TWH per year, almost 1% of the total worldwide power consumption. If the trends of AI adoption continue, then as much as 5% of worldwide power could be used by data centers by 2030. Additionally, there is an unprecedented investment into Generative AI data centers. It is estimated that data centers will consume up to $500 billion for capital expenditures by 2027, mainly fueled by AI infrastructure requirements.
AI compute cost as well as energy consumption will impede mass adoption of Generative AI. Scaling challenges can be overcome by moving AI compute to the edge and using processing solutions optimized for AI workloads. With this approach, other benefits also accrue to the customer, including latency, privacy, reliability, as well as increased capability.
Compute follows data to the Edge
Ever since a decade ago, when AI emerged from the academic world, training and inference of AI models has occurred in the cloud/data center. With much of the data being generated and consumed at the edge – especially video – it only made sense to move the inference of the data to the edge thereby improving the total cost of ownership (TCO) for enterprises due to reduced network and compute costs. While the AI inference costs on the cloud are recurring, the cost of inference at the edge is a one-time, hardware expense. Essentially, augmenting the system with an Edge AI processor lowers the overall operational costs. Like the migration of conventional AI workloads to the Edge (e.g., appliance, device), Generative AI workloads will follow suit. This will bring significant savings to enterprises and consumers.
The move to the edge coupled with an efficient AI accelerator to perform inference functions delivers other benefits as well. Foremost among them is latency. For example, in gaming applications, non-player characters (NPCs) can be controlled and augmented using generative AI. Using LLM models running on edge AI accelerators in a gaming console or PC, gamers can give these characters specific goals, so that they can meaningfully participate in the story. The low latency from local edge inference will allow NPC speech and motions to respond to players' commands and actions in real-time. This will deliver a highly immersive gaming experience in a cost effective and power efficient manner.
In applications such as healthcare, privacy and reliability are extremely important (e.g., patient evaluation, drug recommendations). Data and the associated Gen AI models must be on-premise to protect patient data (privacy) and any network outages that will block access to AI models in the cloud can be catastrophic. An Edge AI appliance running a Gen AI model purpose built for each enterprise customer – in this case a healthcare provider – can seamlessly solve the issues of privacy and reliability while delivering on lower latency and cost.
Many Gen AI models running on the cloud can be close to a trillion parameters – these models can effectively address general purpose queries. However, enterprise specific applications require the models to deliver results that are pertinent to the use case. Take the example of a Gen AI based assistant built to take orders at a fast-food restaurant – for this system to have a seamless customer interaction, the underlying Gen AI model must be trained on the restaurant’s menu items, also knowing the allergens and ingredients. The model size can be optimized by using a superset Large Language Model (LLM) to train a relatively small, 10-30 billion parameter LLM and then use additional fine tuning with the customer specific data. Such a model can deliver results with increased accuracy and capability. And given the model’s smaller size, it can be effectively deployed on an AI accelerator at the Edge.
Gen AI will win at the Edge
There will always be a need for Gen AI running in the cloud, especially for general-purpose applications like ChatGPT and Claude. But when it comes to enterprise specific applications, such as Adobe Photoshop’s generative fill or Github copilot, Generative AI at Edge is not only the future, it’s also the present. Purpose-built AI accelerators are the key to making this possible.
- The Black Box Problem in LLMs: Challenges and Emerging Solutions
- Alex Ratner, CEO & Co-Founder of Snorkel AI – Interview Series
- Circleboom Review: The Best AI-Powered Social Media Tool?
- Stable Video Diffusion: Latent Video Diffusion Models to Large Datasets
- Donny White, CEO & Co-Founder of Satisfi Labs – Interview Series