Interviews

Avi Baum, CTO at Hailo – Interview Series

Published August 18, 2025

Antoine Tardif, CEO & Founder of Unite.AI

Avi Baum, CTO at Hailo, leads the company’s technology vision and product innovation. He previously served as CTO for Wireless Connectivity at Texas Instruments, driving strategies for connected MCUs in IoT and IIoT markets, and held senior architecture and leadership roles in the Israel Defense Forces.

Hailo is an Israeli AI-chip company specializing in high-performance, low-power edge AI processors for applications such as autonomous vehicles, smart cameras, and robotics, supported by a comprehensive software suite and global partner ecosystem.

Can you share what originally drew you to the field of edge AI and how your early engineering experiences shaped your thinking about processor design?

My career path took me to areas of emerging markets. During my tenure at TI (Texas Instruments), a semiconductor leader with a long-standing legacy, I had the opportunity to lead system-level design and architecture, heading the department of product definition and later on serving as the CTO of this department. This led me to continuously explore the up-and-coming technologies that are likely to shape the ‘not-so-far’ future.

When we founded Hailo in 2017, it was clear that AI, which had started to thrive in the cloud, also had the potential to become an enabling technology for edge devices. So, we set course and began this journey.

As generative AI expands at the edge, why is TOPS—tera operations per second—no longer a sufficient benchmark for evaluating processor performance?

TOPS has long been the go-to metric for evaluating AI hardware, but in the era of generative AI at the edge, it's no longer sufficient. The nature of classic models is to translate lots of data into meaningful insights, so the amount of compute needed to process the incoming data is growing with the amount of data that needs to be processed. Models for these tasks are typically smaller in size than the amount of data they process, making the overhead bandwidth attributed to accessing the model parameters relatively negligible.

Generative models, however, are noticeably larger – in the billions-of-parameters domain, and in these cases, memory bandwidth becomes a non-negligible factor.

Rather than focusing on TOPS alone, it’s critical to assess how well a processor balances compute and memory under real-world conditions. It's not about chasing the highest number; it’s about tuning the architecture to the workloads it needs to handle.

Why is memory bandwidth now becoming a more critical bottleneck than compute in edge AI workloads, especially for LLMs and VLMs?

For edge AI workloads, particularly those involving LLMs or VLMs, memory bandwidth is quickly becoming the primary bottleneck. These models typically range from 0.5 to 8 billion parameters, exceeding the capacity of on-chip memory and requiring access to off-chip memory like DRAM. This dramatically increases demand on memory bandwidth. For example, a 1B-parameter model can deliver up to ~40 tokens per second under optimal conditions with a standard LPDDR4X interface, but maintaining that rate with a 4B model requires over four times that bandwidth. Without it, performance suffers, not because of limited compute, but because the processor can’t feed in data quickly enough. This imbalance between compute and memory is one of the most pressing challenges in deploying generative AI at the edge. This is further amplified in architectures that compute layer by layer, where intermediate results also increase memory traffic and further strain bandwidth.

How should product teams rethink their benchmarking strategy when designing for real-world edge applications?

Product teams should move away from relying on a single performance metric like TOPS and instead adopt a benchmarking strategy that reflects the realities of edge deployment. That starts by understanding the specific use case, the actual workload the processor needs to handle, and identifying the “working point”: the intersection of power, cost, and latency constraints. From there, it’s about evaluating how compute and memory interact under those conditions. A processor with high TOPS won’t deliver if memory bandwidth is limited, and more memory won’t help if compute capacity is insufficient.

Teams should assess whether the processor can sustain performance across tasks like perception, enhancive, and generative workloads, each with very different demands. The goal isn’t to optimize for peak specs, but to ensure balanced performance across the full range of expected use cases in real-world environments.

This is a natural shift from ‘sterile’ measures to more intricate approaches that reflect how platforms are used and how they are rated – similar to what happened with other architectures that became mainstream (e.g., SPEC, Coremark, 3DMark, etc.).

How do power and cost constraints influence the architecture decisions behind Hailo processors, especially for consumer-facing edge devices?

Power and cost are two of the most defining constraints when designing AI processors for edge devices, especially in consumer-facing products. In compact devices like IoT sensors or smart home assistants, power budgets are tight, and there's often no active cooling, so energy efficiency becomes critical. Every additional compute or memory resource adds power draw and heat, which directly impacts usability and battery life.

Cost is equally influential. Consumer devices have to stay within competitive price points, meaning the processor can only include so much TOPS and memory before it becomes economically unviable. These constraints force tough architectural trade-offs. At Hailo, we prioritize designs that deliver the right balance of compute and memory to meet real-world application needs within a tight envelope of power and cost, ensuring edge AI becomes viable, efficient, and scalable across a wide range of consumer products.

Could you walk us through how you define a “working point” for an application and why that matters so much in edge AI deployment?

Defining the “working point” is one of the most important steps when designing a system. It refers to the intersection of power, cost, and latency constraints that shape what’s realistically achievable in a specific deployment. Unlike in the cloud, where you can throw more compute or memory at a problem, edge devices operate within a fixed envelope. That means you have to make deliberate trade-offs based on the application’s actual requirements. For example, an IoT sensor might prioritize energy efficiency over raw performance, while an autonomous system might demand ultra-low latency regardless of power draw. Once the working point is established, you can evaluate whether the processor has the right balance of compute and memory to meet that need. It’s not about maximizing specs in every direction; it’s about ensuring sustained, reliable performance in the real-world conditions the application will face.

Generally speaking, the working point is where you want the key performance indicators to be at their optimum. Failing to do so might result in a suboptimal operation under the most typical usage scenarios of the platform.

As a simple example, one could make an AI analytics system extremely efficient when the input is at a very high resolution, but if this is deployed in systems that never reach this resolution, this optimization is meaningless.

With video, audio, and language often blended in modern devices, how do you approach optimization across multimodal models?

Multimodal models require a thoughtful balance of compute and memory resources. Each modality stresses the system differently: video is compute-intensive due to high resolution and frame rates, while language and audio are more compact but place heavier demands on memory bandwidth. In applications like vision-language processing, this split becomes clear (even though this is not a guarantee but a typical scenario): video processing pushes compute, while the language model can quickly hit memory bottlenecks.

We approach optimization by looking at how these workloads interact across the pipeline and ensuring the processor is architected to support them simultaneously, without letting one modality compromise the performance of another.

How does increasing model size at the edge complicate latency and power consumption, and what role does system-level architecture play in solving that?

As model size increases at the edge, latency and power consumption become harder to manage. Larger models rely more heavily on off-chip memory, which increases both energy use and delay, especially when memory bandwidth becomes a bottleneck. For instance, scaling from a 1B- to a 4B-parameter model would require over four times the bandwidth to maintain the same performance – but in practice, performance doesn’t scale linearly due to bandwidth and system-level constraints.

It’s not just about having high TOPS or large memory; it’s about how those components interact. A balanced design ensures compute, memory, and bandwidth work together efficiently, preventing one resource from limiting the whole system.

How does Hailo design for future-proofing—given how rapidly AI models, workloads, and deployment requirements are evolving?

Future-proofing in edge AI means designing processors that can handle a wide range of evolving workloads. At Hailo, we focus on balanced architectures that aren’t tailored to just one task but can support everything from perceptive functions like object detection to generative models like VLMs. Each type of workload stresses compute and memory differently, so we design for flexibility, avoiding bottlenecks when switching between them. We also account for the real-world limits of power, cost, and latency across applications. By prioritizing workload diversity and resource balance, we aim to support the next generation of edge AI deployments across consumer and industrial use cases.

Yet, one size can’t fit all, and the portfolio targets certain addressable applications and tries to fit within the available budget of, e.g., power, form factor and that defines a ‘working point’.

What role does the developer ecosystem play in maximizing the value of a processor, and how are you ensuring teams can make full use of Hailo’s capabilities?

As a programmable device, it’s essential to have easy tools for developers to exercise the processor’s potential, shorten the path to deployment, and enable new use cases. By providing a well-supported environment around our processors, we help teams bring AI applications to life across a range of use cases.

What advice would you give to engineers or CTOs choosing their first AI accelerator for a next-gen product being built today?

With the ripe conditions, I believe there is a lot of innovation potential, allowing us to translate imagination into real products. In a rapidly changing environment, picking an accelerator that enables a rapid concept-to-deployment cycle is critical.

Thank you for the great interview, readers who wish to learn more should visit Hailo.

Unite.AI

Avi Baum, CTO at Hailo – Interview Series

You may like