Connect with us

Interviews

Johan Wadenholt Vrethem, CEO at Voxo – Interview Series

mm

Johan Wadenholt Vrethem brings over two decades of experience at the intersection of technology and business, with a focus on leveraging AI to transform how organizations operate and engage with their customers. He led major digital initiatives and client programs in the banking and finance sectors at CGI, before co-founding Voxo to drive innovation in conversational analytics and event technology.

At Voxo AI, Johan is spearheading the delivery of real-time, AI-powered intelligence from live discussions at events and conferences, empowering teams to move from data to action with speed and precision. Committed to both commercial impact and social good, he has also led CSR initiatives aimed at preventing online child exploitation.

Voxo AI is an event intelligence platform that uses artificial intelligence to capture and transform live spoken conversations from conferences, panels, and sessions into structured, usable insights. By analyzing real-time audio, it generates instant summaries, key takeaways, and post-event content such as reports and branded assets, allowing organizers, attendees, sponsors, and speakers to extract lasting value from discussions without manual note-taking or follow-up work.

Before founding Voxo, you spent years leading complex digital and AI-driven initiatives in banking and financial services at CGI. What specific frustrations or gaps from that experience convinced you it was time to build your own company focused on conversational intelligence?

My time at CGI was incredibly formative. It’s a large organization with hundreds of IP assets in addition to consulting, and I got a front-row seat to complex delivery environments, governance, and enterprise transformation at scale. It was also fragmented, spread across many technologies, stakeholders, and competing priorities.

I moved from Business Analyst to Director in just two years, and at that point I felt ready to focus. When I met my co-founders, it clicked that we could build something sharper, a single track that used the best technology available to solve a very specific, high-value problem. What many people don’t know is that we started as a fintech company focused on documentation in financial advisory. From there we evolved into conversational analytics, and ultimately expanded into event intelligence after nearly a decade of learning how to extract real meaning from human conversation.

Early on, what were the hardest technical or commercial challenges in building AI that could reliably understand real conversations rather than controlled, scripted inputs?

In our earliest fintech products, the technology limited the ambition. Automatic speech recognition for Nordic languages, which was our initial focus, had word error rates in the 70 to 80 percent range. At that level, you simply can’t build a product that replaces human documentation.

At the same time, modern large-language-model capabilities didn’t exist yet, so producing reliable summaries was close to impossible. When we later launched our event service, the landscape had changed. We had built deep know-how over years, and we finally had the right AI building blocks to understand keynotes, debates, and roundtables in a way that could scale.

Voxo started with conversational analytics and later expanded into large-scale event intelligence. What signals told you that live events were the next major frontier for speech AI?

Interestingly, we first started working with events as a way to reach C-level executives faster and demonstrate how powerful conversational intelligence could be. But once we delivered at Sweden’s largest tech event, Techarenan, with more than 10,000 attendees, we saw a huge shift.

The inbound demand was immediate and very clear. People weren’t just impressed, they wanted to buy the event service as a product. That was the signal. We decided to invest the time, focus, and resources required to deliver it globally, and to do it at the highest possible quality level.

From a systems perspective, what fundamentally changes when you move from transcribing a single meeting to processing hundreds of concurrent sessions across a multi-day event?

Complexity compounds fast. You’re not only maintaining stability and quality across each individual session, you’re also dealing with real-world chaos. Last-minute schedule changes, speaker swaps, and program updates are normal at large events.

To deliver without putting extra load on already stretched event teams, you need processes that are rigorous and still flexible. You also need a proven methodology for analysis. You can’t just throw hundreds of hours of audio into a model and ask for an interesting report. To generate high-quality outputs in minutes, you have to combine multiple models, pipelines, and layers of structure.

Many AI tools emphasize automation above all else. Why did you decide to include human-in-the-loop review as a core part of Voxo’s platform?

Trust is still the biggest barrier, especially for enterprise customers like HubSpot, GitHub, and Intuit. The fear of publishing something inaccurate is very real. That’s why stable processes, plus a combination of AI review and human quality assurance, remains a requirement for many customers today.

We also give customers control. They can review and approve summaries before anything is distributed, and we make that workflow efficient. Over time, I believe the need for human review will decrease as the technology and safeguards mature. Until then, nothing matters more than earning the right to be trusted with content that represents their brand.

How has near-real-time transcription and summarization changed the way event teams think about content timelines and post-event value?

It fundamentally resets the timeline. Instead of content being something you publish weeks later, it becomes something you can use while the event is still happening and immediately after each session ends.

What we see is that customers suddenly have material that keeps engagement alive for months. Attendees and speakers are also far more likely to share content right after a session, as long as it’s easy and it looks crisp. If that same content arrives a month later, it’s usually too late to drive meaningful distribution, especially on social media. Near-real-time turns content into an extension of the live experience, not just a post-event archive.

Events involve multiple stakeholders, organizers, speakers, sponsors, and attendees. How does Voxo design outputs that serve all of them without diluting insight or quality?

We design from the stakeholder outward, but we keep the same underlying source of truth. Everyone benefits from the same captured content, then we tailor outputs to match the stakeholder’s goals.

Attendees get instant, shareable session recaps and the ability to revisit sessions they missed. Marketing teams get sponsor-branded assets that are built for distribution and measurable impact. Organizers get higher attendee value, longer event momentum, and new revenue options. Speakers get a one-click way to share a polished summary, and organizers benefit from that network effect.

The key is that we don’t dilute quality. We build one robust content engine, then package it differently for each stakeholder based on what creates real value.

Events using Voxo report faster content delivery and higher sponsor engagement. What do you think matters more in achieving that impact, speed, structure, or insight quality?

It’s the combination. Speed doesn’t matter if the content lacks structure and quality. At the same time, even the best content becomes less valuable if it arrives too late.

The real advantage is delivering all three together. High-quality insights, packaged in a clear structure, delivered fast enough to still feel relevant. That’s what makes content useful, shareable, and commercially impactful.

What does “real-time” truly mean for AI-driven content platforms over the next few years, and how close are we to that reality today?

In some cases, true real-time is already here. We’ve delivered real-time commentary across multiple live streams, for example with NHS in Manchester last summer together with First Sight Media and Lineup Ninja. We also introduced real-time experiences as early as 2023 at Techarenan with speakers like Al Gore and Steve Wozniak.

That said, there’s room for both near-real-time and true real-time at events. The important part is being intentional about what creates value. A real-time word cloud updating behind a speaker may be more distracting than helpful. Real-time should enhance the attendee experience, not compete with it.

Finally, what is one common misconception about AI-generated summaries or transcripts that you regularly have to correct when speaking with enterprise customers?

The biggest misconception is that you can get reliable, consistent, high-quality summaries by simply transcribing an audio file and pasting it into ChatGPT. People also realize quickly that it’s time-consuming and hard to keep consistent, especially when you have a large number of sessions. And even then, transcription and summarization is only a small part of what we deliver. It’s maybe 5 percent. The real work is the speed, structure, the context, the brand-ready packaging, the quality assurance, and the distribution formats that make the content usable and valuable at enterprise scale.

Thank you for the great interview, readers who wish to learn more should visit Voxo AI.

Antoine is a visionary leader and founding partner of Unite.AI, driven by an unwavering passion for shaping and promoting the future of AI and robotics. A serial entrepreneur, he believes that AI will be as disruptive to society as electricity, and is often caught raving about the potential of disruptive technologies and AGI.

As a futurist, he is dedicated to exploring how these innovations will shape our world. In addition, he is the founder of Securities.io, a platform focused on investing in cutting-edge technologies that are redefining the future and reshaping entire sectors.