Dylan Fox is the CEO & Founder of AssemblyAI, a platform that automatically converts audio and video files and live audio streams to text with AssemblyAI's Speech-to-Text APIs.
What initially attracted you to machine learning?
I started out by learning how to program and attended Python Meetups in Washington DC, where I went to college. Through college courses, I found myself leaning more into algorithm-type of programming problems, which naturally led me to machine learning and NLP.
Previous to founding AssemblyAI, you were a Senior Software Engineer at Cisco, what were you working on?
At Cisco, I was a Senior Software Engineer focusing on Machine Learning for their collaboration products.
How did your work at Cisco and a problem with sourcing speech recognition technology inspire you to launch AssemblyAI?
In some of my prior jobs, I had the opportunity to work on a lot of AI projects, including several projects that required speech recognition. But all of the companies offering speech recognition as a service were insanely antiquated, hard to buy anything from, and were running outdated AI tech.
As I became more and more interested in AI research, I noticed there was a lot of work being done in the field of speech recognition and how quickly the research was improving. So it was a combination of factors that inspired me to think, “What if you could build a Twilio-style API company using the latest AI research that was just much easier for developers to access state-of-the-art AI models for speech recognition, with a much better developer experience.”
And it was from there that the idea for AssemblyAI grew.
What is the biggest challenge behind building accurate and reliable speech recognition technology?
Cost and talent are the biggest challenges for any company to tackle when building accurate and reliable speech recognition technology.
The data is expensive to acquire, and you typically need hundreds of thousands of hours to build a robust speech recognition system. Not only that, compute requirements are enormous to train. And serving these models in production is also costly, and requires specialized talent to optimize and make it economical.
Building these technologies also requires a specialized skillset which is hard to find. That’s a big reason why customers come to us for powerful AI models that we research, train, and deploy in-house. They get access to years of research into state-of-the-art AI models for ASR and NLP, all with a simple API.
Outside of purely transcribing audio and video content AssemblyAI offers additional models, can you discuss what these models are?
Our suite of AI models extends beyond just real-time and asynchronous transcription. We refer to these additional models as Audio Intelligence models as they help customers analyze and better understand audio data.
Our Summarization model provides an overall summary, as well as time-coded summaries that automatically segment and generate a summary for each “chapter” as topics in a conversation changes (similar to YouTube chapters).
Our Sentiment Analysis model detects the sentiment of each sentence of speech spoken in audio files. Each sentence in a transcript can be marked as Positive, Negative, or Neutral.
Our Entity Detection model identifies a wide range of entities that are spoken in audio files, such as person or company names, email addresses, dates, and locations.
Our Topic Detection model labels the topics that are spoken in audio and video files. The predicted topic labels follow the standardized IAB Taxonomy, which makes them suitable for contextual targeting.
Our Content Moderation model detects sensitive content in audio and video files — such as hate speech, violence, sensitive social issues, alcohol, drugs, and more.
What are some of the biggest use cases for companies using AssemblyAI?
The biggest use cases companies have for AssemblyAI span across four categories: telephony, video, virtual meetings, and media.
CallRail is a great example of a customer in the Telephony space, who leverages AssemblyAI’s AI models — Core Transcription, Automatic Transcript Highlights, and PII Redaction — to deliver a powerful Conversational Intelligence solution to its customers.
Essentially, CallRail can now automatically surface and define key content in their phone calls to their customers at scale — key content such as specific customer requests, commonly asked questions, and frequently used keywords and phrases. Our PII Redaction model helps them automatically detect and remove sensitive data found in transcript text (e.g. social security numbers, credit card numbers, personal addresses, and more).
Video use cases range from video streaming platforms to video editors like Veed, who use AssemblyAI’s Core Transcription models to simplify the video editing process for users. Veed allows its users to transcribe its videos and edit them directly using the captions.
In Virtual Meetings, meeting transcription software companies like Fathom are using AssemblyAI to build intelligent features that help their users transcribe and highlight the key moments from their Zoom calls, fostering better meeting engagement and eliminating tedious tasks during and after meetings (e.g. taking notes).
In Media, we see podcast hosting platforms for example, use our Content Moderation and Topic Detection models so they can offer better ad tools for brand safety use cases and monetize user generated content with dynamic ads.
AssemblyAI recently raised a $30M Series B round. How will this accelerate the AssemblyAI mission?
The progress being made in the field of AI is incredibly exciting. Our goal is to expose this progress to every developer and product team on the internet — via a simple set of APIs. As we continue to research and train State-of-the-Art AI models for ASR and NLP tasks (like speech recognition, summarization, language identification, and many other tasks), we will continue to expose these AI models to developers and product teams via simple APIs — available for free.
AssemblyAI is a place where both developers and product teams can come to for easy access to the advanced AI models they need in order to build exciting new products, services, and entire companies.
Over the past 6 months, we’ve launched ASR support for 15 new languages—including Spanish, German, French, Italian, Hindi, and Japanese, released major improvements to our Summarization model, Real-Time ASR models, Content Moderation models, and countless other product updates.
We’ve barely dipped into our Series A funds, but this new funding will give us the ability to aggressively scale up our efforts — without compromising on our runway.
With this new funding, we’ll be able to accelerate our product roadmap, build out better AI infrastructure to accelerate our AI research and inference engines, and grow our AI research team — which today include researchers from DeepMind, Google Brain, Meta AI, BMW, and Cisco.
Is there anything else that you would like to share about AssemblyAI?
Our mission is to make State-of-the-Art AI models accessible to developers and product teams at extremely large scale through a simple API.
Thank you for the great interview, readers who wish to learn more should visit AssemblyAI.