LXT Chief Growth Officer Phil Hall is a former Appen executive and Forbes Technology Council member. In his leadership role at Appen he ran a division of 1,000+ staff and played a key role in achieving 17 consecutive years of revenue growth with consistently strong profitability. In his current role with LXT, he is working with a hand-picked team of experts to achieve ambitious growth goals.
LXT is an emerging leader in AI training data to power intelligent technology for global organizations, including the largest technology companies in the world. In partnership with an international network of contributors, LXT collects and annotates data across multiple modalities with the speed, scale and agility required by the enterprise. They have a global expertise that spans more than115 countries and 750 language locales. Founded in 2010, LXT is headquartered in Toronto, Canada with presence in the United States, Australia, Egypt, UK and Turkey. The company serves customers in North America, Europe, Asia Pacific and the Middle East.
When did you initially discover that you were passionate about language?
I’ve been intrigued by language for as long as I can remember, but in terms of my direct engagement with language and linguistics, there was a single significant turning point for me. We realized very early on that one of our children was dyslexic, and when we spoke to her school about additional support they said that while there were programs they could access, there were also things I could do as a volunteer at the school to help our daughter and other children. It went well, and from there I went on to study linguistics and found myself teaching at two of the universities here in Sydney.
You were teaching linguistics before you moved into the speech data space, what inspired you to shift your focus?
Sydney-based Appen was just making the transition from being an operation run out of a spare room in a home to being a fully-fledged commercial operation. I was told they were looking for linguists (perhaps more accurately, a linguist!) and I was introduced to the founders Julie and Chris Vonwiller. The transition was gradual and stretched over about two years. I was reluctant to walk away from teaching – working with high achieving students was both inspiring and a lot of fun. But especially during those pioneering years I was solving difficult problems alongside the world’s leading language technology experts, and the excitement levels were high. A lot of what is taken for granted today, was very challenging at that time.
You came out of retirement to join LXT. What motivated you to do this?
That is an interesting question as I was definitely enjoying myself in retirement. In fact, our co-founder and CEO Mohammad Omar approached me months before I responded to his initial inquiry, as I was living a relaxed lifestyle and hadn’t really contemplated returning to full-time work. After agreeing to take the first call where Mo asked about the possibility of joining LXT, I expected to just listen politely and decline.
But in the end, the opportunity was simply too good to resist.
While speaking with Mohammad and the other members of the LXT team, I immediately recognized a shared passion for language. The team that Mohammad had assembled was stocked with creative thinkers with boundless energy who were fully committed to the company’s mission.
As I learned more about the opportunity with LXT, I realized it was one that I didn’t want to pass up. Here was a company with massive potential to expand and grow in an area I’m passionate about. And as the market for AI continues to grow exponentially, the opportunity to help more organizations move from experimentation to production is an exciting one that I am very happy to be a part of.
What are some of the current challenges behind acquiring data at scale?
The challenges are as varied as the applications driving them.
From a practical perspective challenges include authenticity, reliability, accuracy, security and ensuring that the data is fit for the purpose – and that is without taking into account the growing number of legal and ethical challenges inherent in data acquisition.
For example, the development of technology in support of autonomous vehicles requires collection of extremely large volumes of data across a multitude of scenarios so that the car will understand how to respond to real world situations. There are endless numbers of edge cases that one can encounter when driving, so the algorithms that power those vehicles need datasets that cover everything from streets to stop signs to falling objects. And then if you multiply that by the number of weather events that can occur, the amount of training data needed increases exponentially. Automotive companies venturing into the autonomous space need to establish a reliable data pipeline, and doing that on their own would take a massive amount of resources.
Another use case is the expansion of an existing voice AI product into new markets to capture market share and new customers. This inevitably requires language data, and to achieve accuracy it’s critical to source speech data from native speakers across a variety of demographic profiles. Once the data has been collected, the speech files need to be transcribed to train the product’s NLP algorithms. Doing this for multiple languages and at the data volumes that are needed to be effective is extremely challenging for companies to do on their own, particularly if they lack the internal expertise in this field.
These are just two examples of the many challenges that exist with data collection for AI at scale, but as you can imagine, home automation, mobile device and biometric data collections each also have their specific challenges.
What are the current ways that LXT sources and annotates data?
At LXT, we collect and annotate data differently for each customer, as all of our engagements are tailored to meet our clients’ specifications. We work across a variety of data types, including audio, image, speech, text and video. For data collections, we work with a global network of contractors to collect data in these different modalities. Collections can range from acquiring data in real-world settings such as homes, offices or in-car, to in-studio with experienced engineers in the case of certain speech data collection projects.
Our data annotation capabilities also span multiple modalities. Our experience began in the speech space and over the past 12 years we’ve expanded into over 115 countries and more than 750 language locales. This means that companies of all sizes can depend on LXT to help them penetrate a wide range of markets and capture new customer segments. More recently we’ve expanded into text, image and video data, and our internal platform is used to deliver high-quality data to our customers.
Another exciting area of growth for us has been with our secure annotation work. Just this year we expanded our ISO 27001 secure facility footprint from two to five locations worldwide. We’ve now developed a playbook that enables us to establish new facilities in a matter of months. The services we focus on in these secure facilities are currently speech data annotation and transcription, but they can be used for annotation across many data types.
Why is sourcing data this way a superior alternative to synthetic data?
Synthetic data is an exciting development in the field of AI and is well suited to specific use cases, particularly edge cases that are hard to capture in the real world. The use of synthetic data is on the rise, particularly in the early stages of AI maturity as companies are still in experimentation mode. However, our own research shows that as organizations mature their AI strategies and push more models into production they are much more likely to use supervised or semi-supervised machine learning methods that rely on human-annotated data.
Humans are simply better than computers at understanding the nuances to create the data needed to train ML models to perform with high accuracy, and human oversight is also critical to reduce bias.
Why is this data so important to speech and Natural Language Processing?
For speech and natural language processing algorithms to work effectively in their intended markets, they need to be trained with high volumes of data sourced from native speakers who have the cultural context of the end users they represent. Without this data, voice AI adoption will have severe limitations.
In addition, the environment needs to be accounted for when collecting speech data. If the voice AI solution being trained will be used in a car, for example, there are different road and weather conditions that affect speech and need to be taken into account. These are complex scenarios where an experienced data partner can help.
Is there anything else that you would like to share about LXT?
First, I want to thank you for the opportunity to share our story! I’d like to highlight that our company is committed to helping organizations of all sizes succeed with their AI initiatives. We’ve been focused on delivering highly-customized AI data to companies around the world for over 12 years and we’d be happy to connect with anyone looking to create a reliable data pipeline to support their AI projects.
Thank you for the great interview, readers who wish to learn more should visit LXT.