Anastassia Loukina is a research scientist at Educational Testing Services (ETS) where she works on automated scoring of speech.
Her research interests span a wide range of topics. She has worked among other things on Modern Greek dialects, speech rhythm and automated prosody analysis.
Her current work focuses on combining tools and methods from speech technologies and machine learning with insights from studies on speech perception/production in order to build automated scoring models for evaluating non-native speech.
You clearly have a love of languages, what introduced you to this passion?
I grew up speaking Russian in St. Petersburg, Russia and I remember being fascinated when I was first introduced to the English language: for some words, there was a pattern that made it possible to “convert” a Russian word to an English word. And then I would come across a word where “my” pattern failed and try to come up with a better, more general rule. At that time of course, I knew nothing about linguistic typology or the difference between cognates and loan words, but this fueled my curiosity and desire to learn more languages. This passion for identifying patterns in how people speak and testing them on the data is what lead me to phonetics, machine learning and the work I am doing now.
Prior to your current work in Natural Language Processing (NLP) you were a translator between English-Russian and Modern Greek-Russian. Do you believe that your work as a translator has given you additional insights into some of the nuances and problems associated with NLP?
My primary identity has always been that of a researcher. It’s true that I started my academic career as a scholar of Modern Greek, or more specifically, Modern Greek phonetics. For my doctoral work, I explored phonetic differences between several Modern Greek dialects and how the differences between these dialects could be linked to the history of the area. I argued that some of the differences between the dialects could have emerged as a result of the language contact between each dialect and other languages spoken in the area. While I no longer work on Modern Greek, the changes that happen when two languages come in contact with each other is still at the heart of my work: only this time I focus on what happens when an individual is learning a new language and how technology can help do this most efficiently.
When it comes to the English language, there are a myriad of accents. How do you design an NLP with the capability to understand all of the different dialects? Is it a simple matter of feeding the deep learning algorithm additional big data from each type of accent?
There are several approaches that have been used in the past to address this. In addition to building one large model that covers all accents, you could first identify the accent and then use a custom model for this accent, or you can try multiple models at once and pick the one which works best. Ultimately, to achieve a good performance on a wide range of accents you need training and evaluation data representative of the many accents that a system may encounter.
At ETS we conduct comprehensive evaluations to make sure that the scores produced by our automated systems reflect differences in the actual skills we want to measure and are not influenced by the demographic characteristics of the learner such as their gender, race, or country of origin.
Children and/or language learners often have difficulty with perfect pronunciation. How do you overcome the pronunciation problem?
There is no such thing as perfect pronunciation: the way we speak is closely linked to our identity and as developers and researchers our goal is to make sure that our systems are fair to all users.
Both language learners and children present particular challenges for speech-based systems. For example, child voices not only have very different acoustic quality but children also speak differently from adults and there is a lot of variability between children. As a result, developing an automated speech recognition for children is usually a separate task that requires a large amount of child speech data.
Similarly, even though there are many similarities between language learners from the same background, learners can vary widely in their use of phonetic, grammatical and lexical patterns making speech recognition a particularly challenging task. When building our systems for scoring English language proficiency, we use the data from language learners with a wide range of proficiencies and native languages.
In January 2018, you published ‘Using exemplar responses for training and evaluating automated speech scoring systems‘. What are some of the main breakthroughs fundamentals that should be understood from this paper?
In this paper, we looked at how quality of training and testing data affects the performance of automated scoring systems.
Automated scoring systems, like many other automated systems, are trained on data that has been labeled for humans. In this case, these are scores assigned by human raters. Human raters do not always agree in the scores they assign. There are several different strategies used in assessment to ensure that the final score reported to the test-taker remains highly reliable despite variation in human agreement at the level of the individual question. However, since automated scoring engines are usually trained using response-level scores, any inconsistencies in such scores due to the variety of reasons outlined above may negatively affect the system.
We were able to have access to a large amount of data with different agreement between human raters and to compare system performance under different conditions. What we found is that training the system on perfect data doesn’t actually improve its performance over a system trained on the data with more noisy labels. Perfect labels only give you an advantage when your total size of the training set is very low. On the other hand, the quality of human labels had a huge effect on system evaluation: your performance estimates can be up to 30% higher if you evaluated on clean labels.
The takeaway message is that if you have a lot of data and resources to clean your gold-standard labels, it might be smarter to clean the labels in the evaluation set rather than the labels in the training set. And this finding applies not just to automated scoring but to many other areas too.
Could you describe some of your work at ETS?
I work on a speech scoring engine system that process spoken language in an educational context. One such system is SpeechRater®, which uses advanced speech recognition and analysis technology to assess and provide detailed feedback about English language speaking proficiency. SpeechRater is a very mature application that has been around for more than 10 years. I build scoring models for different applications and work with other colleagues across ETS to ensure that our scores are reliable, fair and valid for all test takers. We also work with other groups at ETS to continuously monitor system performance.
In addition to maintaining and improving our operational systems, we prototype new systems. One of the projects I am very excited about is RelayReader™: an application designed to help developing readers gain fluency and confidence. When reading with RelayReader, a user takes turns listening to and reading aloud a book. Their reading is then sent to our servers to provide feedback. In terms of speech processing, the main challenge of this application is how to measure learning and provide actionable and reliable feedback unobtrusively, without interfering with the reader’s engagement with the book.
What’s your favorite part of working with ETS?
What initially attracted me to ETS is that it is a non-profit organization with a mission to advance the quality of education for all people around the world. While of course it is great when research leads to a product, I appreciate having an opportunity to work on projects that are more foundational in nature but will help with product development in the future. I also cherish the fact that ETS takes issues such as data privacy and fairness very seriously and all our systems undergo very stringent assessment before being deployed operationally.
But what truly makes ETS a great place to work is its people. We have an amazing community of scientists, engineers and developers from many different backgrounds which allows for a lot of interesting collaborations.
Do you believe that an AI will ever be able to pass the Turing Test?
Since the 1950s, there have been a lot of interpretation of how the Turing test should be done in practice. There is probably a general agreement that the Turing test hasn’t been passed in a philosophical sense that there is no AI system that thinks like human. However, this has also become a very niche subject. Most people don’t build their systems to pass Turing test – we want them to achieve specific goals.
For some of these tasks, for example, speech recognition or natural language understanding, human performance may be rightly considered the gold standard. But there are also many other tasks where we would expect an automated system to do much better than humans or where an automated system and human expert need to work together to achieve the best result. For example, in an educational context we don’t want an AI system to replace a teacher: we want it to help teachers, whether it is through identifying patterns in student learning trajectories, help with grading or finding the best teaching materials.
Is there anything else that you would like to share about ETS or NLP?
Many people know ETS for its assessments and automated scoring systems. But we do much more than that. We have many capabilities from voice biometrics to spoken dialogue applications and we are always looking for new ways to integrate technology into learning. Now that many students are learning from home, we have opened several of our research capabilities to general public.
Thank you for the interview and for offering this insight on the latest advances in NLP and speech recognition. Anyone who wishes to learn more can visit Educational Testing Services.