Interviews
Alexey Aylarov, Co-Founder and CEO of Voximplant – Interview Series

Alexey Aylarov co-founded Voximplant after a decade spent building communication tools from the ground up. His early work included IP PBX development and running his own telecom software company long before cloud telephony became mainstream. Zingaya came next, bringing click-to-call inside the browser. Voximplant followed, growing into a serverless platform developers rely on for real-time voice and video. Alexey writes about the practical side of Voice AI, especially where large language models collide with the messy realities of global telephony.
You started your career as a VoIP engineer in the mid-2000s, long before AI entered real-time communications. What were the biggest gaps you saw back then that eventually pushed you toward founding Voximplant?
I’ve been involved with VoIP systems since 2005. Back then, building reliable communications was slow and complex. I noticed that many developers shared my frustration – teams were trying to wire telecom components instead of focusing on the product experience they actually wanted to deliver. This pushed me to move towards the idea of programmable communications for developers. We wanted to create a product that would allow everyone to build products without needing to be telecom experts.
Before Voximplant, I co-founded SIP-based calling services Flashphone and Zingaya, which offered early click-to-call products. The demand proved once again that teams wanted programmable communication, but the tooling wasn’t there yet. All of that led to the creation of Voximplant in 2013.
Today, we are seeing a similar gap, but on a bigger scale. Voice AI is entering production flows, LLMs continue to evolve every month, but the global phone network remains fragmented. No single vendor can solve everything end-to-end. That is why Voximplant acts as an orchestration layer, offering developers a fast and cost-effective way to experiment with the latest and most advanced tools and to deploy Voice Agents on real calls, without worrying about telephony infrastructure or streaming complexity.
Voximplant positions itself as an orchestration layer rather than a single AI or telephony provider. Why did you believe orchestration was the right abstraction layer to build for the future of voice AI?
It was important for us from the beginning to be global, and you can’t provide a global telephony platform without doing some telephony orchestration. Technical requirements and infrastructure vary by country, and we offer phone numbers in more than 190 countries, so this means we do a lot of technical mediation.
In addition, telephony standards like SIP have evolved into many flavors across vendors. Connecting different telcos and various customer communications infrastructures requires flexible systems that can adapt quickly. Newer phone networks, like WhatsApp, for example, continue to drive needs here – and this is all before adding the communications control layer logic on top that actually executes our customers’ unique application logic.
On the AI side, the market is very intense and evolving rapidly. The “best” vendor today is likely to be second or third place next week. Our approach is to support as many of the leading providers as possible. We want our customers to always have a full set of state-of-the-art options to choose from. They can choose the right AI providers for their given application – or even mix and match. Our orchestration platform also aims to make switching between providers simpler – while still exposing their full capabilities so developers don’t get stuck with a lowest common denominator feature set.
Many teams underestimate how difficult it is for a voice AI agent to place and manage real phone calls. From your perspective, what makes real-world telephony so challenging compared to purely digital AI interactions?
The phone network is still highly fragmented and inconsistent across regions, making it even more unpredictable. In some countries, certain protocols may be restricted or blocked, carriers experience outages as part of normal operations, and call routing patterns can shift throughout the day. There are also regions where cloud telephony can be legally complicated.
We’ve also seen cases where the infrastructure itself becomes the bottleneck. For example, an Australian healthcare startup building an AI caller to check in on elderly Cantonese-speaking patients struggled with high latency to US-based Voice AI providers (like OpenAI or ElevenLabs), and the limited availability of high-quality Cantonese TTS made the conversations feel slow and unnatural.
On top of reliability, there’s the compliance layer. Requirements vary widely from country to country and often overlap with frameworks such as HIPAA, PCI DSS, and GDPR.
Speech performance itself isn’t universal either. No single STT or TTS engine works best in every environment. Accents, background noise, call quality fluctuations, or even provider degradation can cause sudden drops in accuracy and user experience.
Some Voice AI systems today rely on multiple vendors for LLMs, speech-to-text, text-to-speech, and routing. Why is this fragmentation inevitable, and why should swapping AI or speech providers be a quick code change rather than a major engineering project?
Early on in Voice AI, there was no true speech-to-speech option, so you had to cobble together speech-to-text, LLM, and text-to-speech. Today, several LLM vendors integrate speech directly (often with some level of barge-in support), removing the need to build a full pipeline. These systems are faster and highly interactive, but still have limitations with aspects like functional calling and offer fewer options for improving transcription and voices. We expect speech-based LLMs to be comparable to text models soon. Even then, customers may still want to use different speech vendors for their specific requirements. Some pipeline separation also adds choices for redundancy.
Swapping AI and speech vendors on our platform isn’t a major engineering effort, but it’s more than a one-line code change, too. Speech vendors are constantly fighting against commoditization by introducing unique features. We keep our connectors as consistent as possible while exposing each provider’s capabilities, so taking advantage of these unique features, swapping providers often means changing a few lines of code.
How are voice AI agents beginning to change the economics of customer support, sales, and other B2C operations compared to traditional call center models?
It may be too early to talk about a significant shift in the economics of customer support, but it’s definitely coming. Today, there are regions where customer support representatives cost less than LLM-powered services, yet this model comes with well-known challenges around scalability, burnout, management, and operations. I assume the economics will change significantly as LLMs optimization continues to improve, although it will still take some time.
What signals tell you that Voice AI is moving from experimentation into mission-critical infrastructure for enterprises?
The strongest signal here is investment in Voice AI infrastructure, which is growing rapidly. There are ways to track Voice AI-enabled calls or minutes at a global scale, if not exactly, through estimates. While I can track only this directly for Voximplant, we clearly see strong growth.
How do you think developer expectations around flexibility and control have changed as AI models and voice technologies iterate faster?
That’s an interesting question. When it comes to speed of change, AI is unmatched by anything we’ve seen in history. Control and flexibility are less straightforward, depending on what we mean by those terms. When it comes to control, there are many well-known challenges, and overcoming them isn’t easy. Most AI companies spend significant efforts on model guardrails, but doing this well requires deep expertise, and different companies clearly have different goals.
What mistakes do companies most commonly make when trying to deploy voice AI agents directly on top of traditional telephony systems?
Traditional telephony systems aren’t directly compatible with Voice AI services, so they usually require additional integration, usually via SIP protocol or WebSockets. Common mistakes include insufficient failover management, latency issues (which can be caused by various factors), and scalability challenges.
Telephony itself scales quite well, especially with VoIP. Voice AI services are harder to scale because of the hardware requirements needed to run LLMs, and even rather big infrastructure players like Amazon can face capacity constraints when it comes to inference hardware.
Looking ahead, what capabilities do you think voice AI platforms must support to remain relevant as real-time AI becomes more autonomous?
I think Voice AI platforms need to focus on SLA, since it can still be an issue at times, and on additional tools for testing and observability.
Eventually, the most advanced platforms will offer everything required, but today, we’re still learning new lessons every day, many of which should become a part of the core stack. If you work with large enterprises or in regulated environments, having an on-prem version of your product can be critical.
When you reflect on your journey from early VoIP infrastructure to leading a voice AI platform today, what has surprised you most about how the industry has evolved?
Many things have surprised me, but one of them is that changes in VoIP infrastructure take years to happen. A good example is that telephony still relies on narrow-band audio codecs (G.711, G.729), while people are already used to wideband audio in online communication services such as Zoom, Google Meet, WhatsApp, etc.
Most AI models are trained on wideband audio data as well. All modern mobile phones have wideband audio codecs built in, but there are still significant interoperability challenges at the carrier level that prevent wideband audio from being used in traditional phone calls. It’s not like there is no progress at all, but in my opinion, it’s been very modest.












