As impressive and useful as virtual assistants like Siri, Alexa, and Google Assistant are, their conversational skills are typically limited to receiving certain commands and delivering pre-defined responses. Companies like Google and Amazon have been pursuing methods of AI training and development that can make AI chatbots more robust and flexible, able to carry on conversations with users in a much more natural way. As reported by DigitalTrends, Google has recently published a paper demonstrating the capabilities of its new chatbot, dubbed “Meena”. According to a blog post from the researchers, Meena can engage in conversation with its users on just about any topic.
Meena is an open-domain chatbot, meaning that it responds to the context of the conversation so far and adapts to inputs in order to deliver more natural responses. Most other chatbots are closed-domain, which means that their responses are themed around certain ideas and limited to accomplishing specific tasks.
According to Google’s report, Meena’s flexibility was the result of a massive training dataset. Meena was trained on around 40 billion words pulled from social media conversations and filtered for the most relevant and representative words. Google aimed to deal with some of the problems that are found in most voice assistants, such as an ability to handle topics and commands that unfold over multiple turns in the conversation, with the user providing additional inputs after the bot has responded to one input. This means that man chatbots are unable to prompt the user for clarification and when there is a query that can’t be interpreted they often just default to web results.
In order to deal with this particular problem, Google’s researchers enabled its algorithms to keep track of the context of the conversation, meaning that it can generate specific answers. The model used an encoder that processes what has already been said in the conversation and a decoder that creates a response based on the context. The model was trained on specific and non-specific data. Specific data is words that are closely related to the proceeding statement. As the Google post explained:
“For example, if A says, ‘I love tennis,’ and B responds, ‘That’s nice,’ then the utterance should be marked, ‘not specific’. That reply could be used in dozens of different contexts. But if B responds, ‘Me too, I can’t get enough of Roger Federer!’, then it is marked as ‘specific’ since it relates closely to what is being discussed.
The data that was used to train the model consisted of seven “turns” in the conversation. During training, the model had 2.6 billion parameters which examined 341 GB of text data for patterns, a dataset around 8.5 times larger than the dataset used to train the GPT-2 model created by OpenAI.
Google reported how Meena performed at the Sensibleness and Specificity Average (SSA) metric. The SSA is a metric designed by Google researchers and it’s intended to quantify the ability of a conversational entity to reply with specific, relevant responses as a conversation goes on.
SSA scores are calculated by testing a model against a fixed number of prompts, and the number of sensible responses that the model gives is tracked. The model’s score is derived based on the percentage of sensible/specific responses the model was able to give with respect to the prompts. Generic responses are penalized. According to Google, an average person scores about 86% on the SSA, while Meena was able to score a 79%. Another famous AI model, an agent created by Pandora Bots, won the Loebner Prize in recognition of the fact that their AI bots achieved sophisticated human-like communication. The Pandora Bots agent achieved approximately 56% in the SSA test.
Microsoft and Amazon are also trying to make more flexible and natural chatbots. Microsoft has been attempting to create multiturn dialogue in chatbots for two years, acquiring Semantic Machines, an AI startup, to improve Cortana. Amazon recently ran the Alexa Prize challenge, which prompted participants to design a bot capable of conversing for approximately 20 minutes.
Brain Implants and AI Model Used To Translate Thought Into Text
Researchers at the University of California, San Francisco have recently created an AI system that can produce text by analyzing a person’s brain activity, essentially translating their thoughts into text. The AI takes neural signals from a user and decodes them, and it can decipher up to 250 words in real-time based on a set of between 30 to 50 sentences.
As reported by the Independent, the AI model was trained on neural signals collected from four women. The participants in the experiment had electrodes implanted in their brains to monitor for the occurrence of epileptic seizures. The participants were instructed to read sentences aloud, and their neural signals were fed to the AI model. The model was able to discern neural activity correlated with specific words, and the patterns aligned with the actual words approximately 97% of the time, with an average error rate of around 3%.
This isn’t the first time that neural signals have been correlated with sentences, neuroscientists have been working on similar projects for over a decade. However, the AI model created by the researchers shows impressive accuracy and operates in more or less real-time. The model utilizes a recurrent neural network to encode the neural activity into representations that can be translated into words. As the authors say in their paper:
“Taking a cue from recent advances in machine translation, we trained a recurrent neural network to encode each sentence-length sequence of neural activity into an abstract representation, and then to decode this representation, word by word, into an English sentence.”
According to ArsTechnica, in order to better understand how links were made between neural signals and words, the researchers experimented by disabling different parts of the system. The systematic disabling made it clear that the system’s accuracy was due to the neural representation. It was also found that disabling the audio inputs to the system made errors jump, but the overall performance was still considered reliable. Obviously, this means the system could potentially be useful as a device for those who cannot speak.
When different portions of the electrode input were disabled, it was found that the system was paying the most attention to certain key brain regions associated with speech processing and production. For instance, a decent portion of the system’s performance was based on brain regions that pay attention to the sound of one’s own voice when speaking.
While the initial results seem promising the research team isn’t sure how well the model will scale to larger vocabularies. It’s important that the principle can be generalized to larger vocabularies, as the average English speaker has an active vocab of approximately 20,000 words. The current decoder method operates by interpreting the static structure of a sentence and using that structure to make educated guesses about the words that match a particular neural activity pattern. As the vocabulary grows, overall accuracy could be reduced as more neural patterns may tend to look similar.
The authors of the paper explain that while they hope the decoder will eventually learn how to discern regular, reliable patterns in language, they aren’t sure how much data is required to train a model capable of generalizing to the everyday English language. One potential way of dealing with this problem is supplementing the training with data gathered from other brain-computer interfaces making use of different algorithms and implants.
The research done by the researchers at University of California is just a recent development in a growing wave of research and development regarding neural interfaces and computers. The Royal Society released a report last year that predicted neural interfaces linking people to computers will eventually let people read each other’s minds. The report references the Neuralink startup created by Elon Musk and technologies developed by Facebook as evidence of the coming advances in human-oriented computing. The Royal Society notes that human-computer interfaces will be a powerful option in treating neurodegenerative diseases such as Alzheimer’s over the next two decades.
Anastassia Loukina, Senior Research Scientist (NLP/Speech) at ETS – Interview Series
Her research interests span a wide range of topics. She has worked among other things on Modern Greek dialects, speech rhythm and automated prosody analysis.
Her current work focuses on combining tools and methods from speech technologies and machine learning with insights from studies on speech perception/production in order to build automated scoring models for evaluating non-native speech.
You clearly have a love of languages, what introduced you to this passion?
I grew up speaking Russian in St. Petersburg, Russia and I remember being fascinated when I was first introduced to the English language: for some words, there was a pattern that made it possible to “convert” a Russian word to an English word. And then I would come across a word where “my” pattern failed and try to come up with a better, more general rule. At that time of course, I knew nothing about linguistic typology or the difference between cognates and loan words, but this fueled my curiosity and desire to learn more languages. This passion for identifying patterns in how people speak and testing them on the data is what lead me to phonetics, machine learning and the work I am doing now.
Prior to your current work in Natural Language Processing (NLP) you were a translator between English-Russian and Modern Greek-Russian. Do you believe that your work as a translator has given you additional insights into some of the nuances and problems associated with NLP?
My primary identity has always been that of a researcher. It’s true that I started my academic career as a scholar of Modern Greek, or more specifically, Modern Greek phonetics. For my doctoral work, I explored phonetic differences between several Modern Greek dialects and how the differences between these dialects could be linked to the history of the area. I argued that some of the differences between the dialects could have emerged as a result of the language contact between each dialect and other languages spoken in the area. While I no longer work on Modern Greek, the changes that happen when two languages come in contact with each other is still at the heart of my work: only this time I focus on what happens when an individual is learning a new language and how technology can help do this most efficiently.
When it comes to the English language, there are a myriad of accents. How do you design an NLP with the capability to understand all of the different dialects? Is it a simple matter of feeding the deep learning algorithm additional big data from each type of accent?
There are several approaches that have been used in the past to address this. In addition to building one large model that covers all accents, you could first identify the accent and then use a custom model for this accent, or you can try multiple models at once and pick the one which works best. Ultimately, to achieve a good performance on a wide range of accents you need training and evaluation data representative of the many accents that a system may encounter.
At ETS we conduct comprehensive evaluations to make sure that the scores produced by our automated systems reflect differences in the actual skills we want to measure and are not influenced by the demographic characteristics of the learner such as their gender, race, or country of origin.
Children and/or language learners often have difficulty with perfect pronunciation. How do you overcome the pronunciation problem?
There is no such thing as perfect pronunciation: the way we speak is closely linked to our identity and as developers and researchers our goal is to make sure that our systems are fair to all users.
Both language learners and children present particular challenges for speech-based systems. For example, child voices not only have very different acoustic quality but children also speak differently from adults and there is a lot of variability between children. As a result, developing an automated speech recognition for children is usually a separate task that requires a large amount of child speech data.
Similarly, even though there are many similarities between language learners from the same background, learners can vary widely in their use of phonetic, grammatical and lexical patterns making speech recognition a particularly challenging task. When building our systems for scoring English language proficiency, we use the data from language learners with a wide range of proficiencies and native languages.
In January 2018, you published ‘Using exemplar responses for training and evaluating automated speech scoring systems‘. What are some of the main breakthroughs fundamentals that should be understood from this paper?
In this paper, we looked at how quality of training and testing data affects the performance of automated scoring systems.
Automated scoring systems, like many other automated systems, are trained on data that has been labeled for humans. In this case, these are scores assigned by human raters. Human raters do not always agree in the scores they assign. There are several different strategies used in assessment to ensure that the final score reported to the test-taker remains highly reliable despite variation in human agreement at the level of the individual question. However, since automated scoring engines are usually trained using response-level scores, any inconsistencies in such scores due to the variety of reasons outlined above may negatively affect the system.
We were able to have access to a large amount of data with different agreement between human raters and to compare system performance under different conditions. What we found is that training the system on perfect data doesn’t actually improve its performance over a system trained on the data with more noisy labels. Perfect labels only give you an advantage when your total size of the training set is very low. On the other hand, the quality of human labels had a huge effect on system evaluation: your performance estimates can be up to 30% higher if you evaluated on clean labels.
The takeaway message is that if you have a lot of data and resources to clean your gold-standard labels, it might be smarter to clean the labels in the evaluation set rather than the labels in the training set. And this finding applies not just to automated scoring but to many other areas too.
Could you describe some of your work at ETS?
I work on a speech scoring engine system that process spoken language in an educational context. One such system is SpeechRater®, which uses advanced speech recognition and analysis technology to assess and provide detailed feedback about English language speaking proficiency. SpeechRater is a very mature application that has been around for more than 10 years. I build scoring models for different applications and work with other colleagues across ETS to ensure that our scores are reliable, fair and valid for all test takers. We also work with other groups at ETS to continuously monitor system performance.
In addition to maintaining and improving our operational systems, we prototype new systems. One of the projects I am very excited about is RelayReader™: an application designed to help developing readers gain fluency and confidence. When reading with RelayReader, a user takes turns listening to and reading aloud a book. Their reading is then sent to our servers to provide feedback. In terms of speech processing, the main challenge of this application is how to measure learning and provide actionable and reliable feedback unobtrusively, without interfering with the reader’s engagement with the book.
What’s your favorite part of working with ETS?
What initially attracted me to ETS is that it is a non-profit organization with a mission to advance the quality of education for all people around the world. While of course it is great when research leads to a product, I appreciate having an opportunity to work on projects that are more foundational in nature but will help with product development in the future. I also cherish the fact that ETS takes issues such as data privacy and fairness very seriously and all our systems undergo very stringent assessment before being deployed operationally.
But what truly makes ETS a great place to work is its people. We have an amazing community of scientists, engineers and developers from many different backgrounds which allows for a lot of interesting collaborations.
Do you believe that an AI will ever be able to pass the Turing Test?
Since the 1950s, there have been a lot of interpretation of how the Turing test should be done in practice. There is probably a general agreement that the Turing test hasn’t been passed in a philosophical sense that there is no AI system that thinks like human. However, this has also become a very niche subject. Most people don’t build their systems to pass Turing test – we want them to achieve specific goals.
For some of these tasks, for example, speech recognition or natural language understanding, human performance may be rightly considered the gold standard. But there are also many other tasks where we would expect an automated system to do much better than humans or where an automated system and human expert need to work together to achieve the best result. For example, in an educational context we don’t want an AI system to replace a teacher: we want it to help teachers, whether it is through identifying patterns in student learning trajectories, help with grading or finding the best teaching materials.
Is there anything else that you would like to share about ETS or NLP?
Many people know ETS for its assessments and automated scoring systems. But we do much more than that. We have many capabilities from voice biometrics to spoken dialogue applications and we are always looking for new ways to integrate technology into learning. Now that many students are learning from home, we have opened several of our research capabilities to general public.
Thank you for the interview and for offering this insight on the latest advances in NLP and speech recognition. Anyone who wishes to learn more can visit Educational Testing Services.
Alexander Hudek, Co-Founder & CTO of Kira Systems – Interview Series
Alex Hudek is the Co-Founder & CTO of Kira Systems. He holds Ph.D and M.Math degrees in Computer Science from the University of Waterloo, and a B.Sc. from the University of Toronto in Physics and Computer Science.
His past research in the field of bioinformatics focused on finding similarities between DNA sequences. He has also worked in the areas of proof systems and database query compilation.
When did you initially become interested in machine learning and AI?
I’ve always been interested in computer science. In undergrad I took courses in algorithms for planning and logic, machine learning and AI, numerical computing, and other topics. My interest in machine learning grew more specifically during my PhD at the University of Waterloo. There, I used machine learning methods to study DNA. Afterwards, I dove more deeply into formal logics as part of my postdoctoral research. Logic and reasoning is in some ways the “other side” of the coin in approaches to AI and I felt it important to know more about it.
Some of your past research in the field of bioinformatics focused on finding similarities between DNA sequences. Could you discuss some of this work?
The main body of my thesis involved building a more realistic model DNA mutation using Hidden Markov Models. I used this more complex model in a new algorithm designed to find regions of DNA that share common ancestry with other species. In particular, this new algorithm can find much more weakly related sequence regions than previous algorithms for the task.
Before my PhD, I worked in a research lab that was part of the human genome project. One of the most notable projects I helped complete was the first complete draft of human chromosome 7.
What was the initial inspiration behind launching Kira?
The idea for Kira came from my co-founder, Noah Waisberg. He had spent hours in his career as a lawyer doing the sort of work we’ve now built AI to do. It was an interesting idea to me because it involved natural language and the problem was well scoped, and I could see the business potential. There is something alluring about building AI that can understand human language because language is so closely related to human cognition.
Can you describe what Contract Analysis Software is and how it benefits legal professionals?
Kira uses supervised machine learning, meaning an experienced lawyer feeds provisions from real contracts into a system designed to learn from those examples. The system studies this data, learns what language is relevant, and builds probabilistic provision models. The models are then tested against a set of annotated agreements that the system is unfamiliar with in order to determine its readiness. This highly accurate machine learning technology can identify and analyze virtually any provision in any contract, resulting in customer-reported time savings of 20-90%. This increased productivity helps Law Firms by increasing their Realization Rates, gives them more opportunity to grow their revenue and preserve their existing clients. For corporations, it drives better productivity in-house reducing the amount of external legal spend required.
Natural Language Processing (NLP) is difficult for most companies, could you discuss some of the additional challenges that are faced when it comes to processing legal terminology and other nuances that are unique to the legal profession?
For many people legal language can seem very foreign, but it turns out that from a machine learning perspective it’s not actually that different. There are a few more unique things; capitalization is more important and sentences can be much longer than normal, but overall we haven’t needed significantly different NLP approaches than in other domains.
One aspect that is significantly different is the need for data privacy and customization. Legal professionals are required to keep client data confidential, and using it in a machine learning product that pools or shares training data is at odds with those requirements. In fact, even keeping training data is often not possible as they have obligations to delete client data after a project concludes. Thus, being able to train models without vendors in the loop becomes critical, as do machine learning techniques that make it hard or impossible to recover any part of the training data by inspecting learned models. Techniques that allow you to take an existing model and update it with new training data without retraining from scratch are also a must have.
On the customization front, there is a need for clients to be able to build their own models. This is because for more complex legal concepts there can be reasonable disagreement among professionals, and firms often want to tune or build models to match their own unique positions.
Could you describe how deep learning is used to categorize data within Kira software?
We don’t use much deep learning in our product, though our internal research team does spend a lot of time evaluating and exploring deep learning solutions. So far, on the sorts of problems we face, deep learning techniques are only matching non-deep learning approaches, or at best getting a very small increase. Given the huge computation overhead of deep learning methodologies, as well as challenges in keeping training data private, they haven’t been compelling enough to adopt so far.
That said, we do find deep learning approaches to be very compelling and we think they have a potential to become big in NLP one day. To that end, we continually evaluate and explore deep learning NLP approaches so that we can be ready to adopt when the advantages start outweighing the disadvantages.
What are some of the built-in provision models that Kira offers?
Currently Kira can identify and extract over 1,000 built-in provisions, clauses, and data points (smart fields). They relate to a multitude of different topics, from M&A Due Diligence—which Kira was originally conceived to assist with—to Brexit; to Real Estate. The smart fields are built by our team of subject matter experts that include experienced lawyers and accountants. With our machine learning technology, Kira’s standards require virtually every smart field to achieve a minimum of 90% recall, meaning our software will find 90% or more of the provision, clause or data point you’re specifically looking for within your contracts or documents, reducing risks and errors in the contract review process. In addition, an unlimited number of custom fields can be created/taught by a firm to automatically identify and extract relevant insights using our Quick Study tool.
The legal world is often known for being slow to adopt new technology. Do you find that there’s an education hurdle when it comes to educating law firms?
Lawyers really like to know how things work, so education is important. It’s no harder to teach lawyers about machine learning and AI then other professionals, but it is definitely required to have training materials ready. Many of the adoption hurdles are social too; people often ask about best practices in adapting their internal processes to use AI, or are interested in how they can use AI to change their business offerings in a way that gives them advantages beyond just efficiency improvements.
Compared to when we started Kira Systems in 2011, law firms today are far more savvy about AI and technology. Many have innovation teams who are tasked with investigating new technology and encouraging adoption of new solutions.
Is there anything else that you would like to share about Kira?
Academic literature and open source machine learning libraries were instrumental in helping us bootstrap the company. We believe that open information and software is a huge boon to the world. In light of that, I’m especially happy that our research team publishes the results of many of our research efforts in academic journals and conferences. Aside from demonstrating that we push the boundaries of the state of the art, this allows us to give back to the communities that helped us get started, and that we continue to get a ton of value from. You can find our papers at https://kirasystems.com/science/.
To learn more visit Kira Systems.
- Facial Expressions Of Mice Analyze With Artificial Intelligence
- Shell Begins to Reskill Workers in Artificial Intelligence
- Anthony Macciola, Chief Innovation Officer at ABBYY – Interview Series
- Brain Implants and AI Model Used To Translate Thought Into Text
- Marc Sloan, Co-Founder & CEO of Scout – Interview Series