Researchers from the University of Maryland were able to create a set of questions that are easy for people to answer but hard for some of the best computer answering systems that exist today. The team generated the questions through a human-computer collaboration, and they were able to create a database of more than 1,200 words. If a computer system is able to learn and master these questions, it will have the best understanding of human language among any computer systems that currently exist.
The work was published in an article in the journal Transactions of the Association for Computational Linguistics.
Jordan Boyd-Gaber, an associate professor of computer science at UMD and senior author of the paper, spoke about the new developments.
“Most question-answering computer systems don’t explain why they answer the way they do, but our work helps us see what computers actually understand,” he said. “In addition, we have produced a dataset to test on computers that will reveal if a computer language system is actually reading and doing the same sorts of processing that humans are able to do.”
As of right now, questions for these programs and systems are generated by human authors or computers. The problem is that when humans are the ones generating questions, they aren’t aware of all of the different elements of a question that are confusing to computers. Computer systems on the other hand, they use formulas, write fill-in-the blank questions, or make mistakes all which can generate nonsense.
In order to get cooperation between humans and computers that allowed them to generate the questions, Boyd-Garber and the team of researchers created a special computer interface. According to them, it is able to tell what a computer is “thinking” while a human types out a question. The writer is then able to edit and change the question based on the computer’s weaknesses. This is able to generate confusion for the computer.
As the writer types the question, the computer’s guesses are put in a ranked order. The words that are responsible for the computer’s guesses are highlighted.
The system can correctly answer a question and the interface will highlight the words or phrases that led to the answer. With that info, the author is then able to edit the question to make it more difficult for the computer, but the question will still have the same meaning. While the computer will eventually be confused, expert humans would still be able to answer.
When the humans and computers worked together, they were able to develop 1,213 computer questions that the computer was not able to answer. The researchers tested the questions in a competition between human players and the computers. The human players included high school trivia teams and “Jeopardy!” champions. The weakest human team was able to defeat the strongest computer system.
Shi Feng, a computer science graduate student from UMD and co-author of the paper spoke about the new research.
“For three or four years, people have been aware that computer question-answering systems are very brittle and can be fooled very easily,” she said. “But this is the first paper we are aware of that actually uses a machine to help humans break the model itself.”
The questions used were able to reveal six different language phenomena that confuse computers. There are two different categories. The first one is linguistic phenomena that includes paraphrasing, distracting language, and unexpected contexts. The second is reasoning skills and includes logic and calculation, mental triangulation of elements in a question, and putting together multiple steps to form a conclusion.
“Humans are able to generalize more and to see deeper connections,” Boyd-Garber said. “They don’t have the limitless memory of computers, but they still have an advantage in being able to see the forest for the trees. Cataloguing the problems computers have helps us understand the issues we need to address, so that we can actually get computers to begin to see the forest through the trees and answer questions in the way humans do.”
This research lays the foundation for computer systems to eventually master the human language. It will undoubtedly keep getting developed and improved.
“This paper is laying out a research agenda for the next several years so that we can actually get computers to answer questions well,” Boyd-Garber said.
Anastassia Loukina, Senior Research Scientist (NLP/Speech) at ETS – Interview Series
Her research interests span a wide range of topics. She has worked among other things on Modern Greek dialects, speech rhythm and automated prosody analysis.
Her current work focuses on combining tools and methods from speech technologies and machine learning with insights from studies on speech perception/production in order to build automated scoring models for evaluating non-native speech.
You clearly have a love of languages, what introduced you to this passion?
I grew up speaking Russian in St. Petersburg, Russia and I remember being fascinated when I was first introduced to the English language: for some words, there was a pattern that made it possible to “convert” a Russian word to an English word. And then I would come across a word where “my” pattern failed and try to come up with a better, more general rule. At that time of course, I knew nothing about linguistic typology or the difference between cognates and loan words, but this fueled my curiosity and desire to learn more languages. This passion for identifying patterns in how people speak and testing them on the data is what lead me to phonetics, machine learning and the work I am doing now.
Prior to your current work in Natural Language Processing (NLP) you were a translator between English-Russian and Modern Greek-Russian. Do you believe that your work as a translator has given you additional insights into some of the nuances and problems associated with NLP?
My primary identity has always been that of a researcher. It’s true that I started my academic career as a scholar of Modern Greek, or more specifically, Modern Greek phonetics. For my doctoral work, I explored phonetic differences between several Modern Greek dialects and how the differences between these dialects could be linked to the history of the area. I argued that some of the differences between the dialects could have emerged as a result of the language contact between each dialect and other languages spoken in the area. While I no longer work on Modern Greek, the changes that happen when two languages come in contact with each other is still at the heart of my work: only this time I focus on what happens when an individual is learning a new language and how technology can help do this most efficiently.
When it comes to the English language, there are a myriad of accents. How do you design an NLP with the capability to understand all of the different dialects? Is it a simple matter of feeding the deep learning algorithm additional big data from each type of accent?
There are several approaches that have been used in the past to address this. In addition to building one large model that covers all accents, you could first identify the accent and then use a custom model for this accent, or you can try multiple models at once and pick the one which works best. Ultimately, to achieve a good performance on a wide range of accents you need training and evaluation data representative of the many accents that a system may encounter.
At ETS we conduct comprehensive evaluations to make sure that the scores produced by our automated systems reflect differences in the actual skills we want to measure and are not influenced by the demographic characteristics of the learner such as their gender, race, or country of origin.
Children and/or language learners often have difficulty with perfect pronunciation. How do you overcome the pronunciation problem?
There is no such thing as perfect pronunciation: the way we speak is closely linked to our identity and as developers and researchers our goal is to make sure that our systems are fair to all users.
Both language learners and children present particular challenges for speech-based systems. For example, child voices not only have very different acoustic quality but children also speak differently from adults and there is a lot of variability between children. As a result, developing an automated speech recognition for children is usually a separate task that requires a large amount of child speech data.
Similarly, even though there are many similarities between language learners from the same background, learners can vary widely in their use of phonetic, grammatical and lexical patterns making speech recognition a particularly challenging task. When building our systems for scoring English language proficiency, we use the data from language learners with a wide range of proficiencies and native languages.
In January 2018, you published ‘Using exemplar responses for training and evaluating automated speech scoring systems‘. What are some of the main breakthroughs fundamentals that should be understood from this paper?
In this paper, we looked at how quality of training and testing data affects the performance of automated scoring systems.
Automated scoring systems, like many other automated systems, are trained on data that has been labeled for humans. In this case, these are scores assigned by human raters. Human raters do not always agree in the scores they assign. There are several different strategies used in assessment to ensure that the final score reported to the test-taker remains highly reliable despite variation in human agreement at the level of the individual question. However, since automated scoring engines are usually trained using response-level scores, any inconsistencies in such scores due to the variety of reasons outlined above may negatively affect the system.
We were able to have access to a large amount of data with different agreement between human raters and to compare system performance under different conditions. What we found is that training the system on perfect data doesn’t actually improve its performance over a system trained on the data with more noisy labels. Perfect labels only give you an advantage when your total size of the training set is very low. On the other hand, the quality of human labels had a huge effect on system evaluation: your performance estimates can be up to 30% higher if you evaluated on clean labels.
The takeaway message is that if you have a lot of data and resources to clean your gold-standard labels, it might be smarter to clean the labels in the evaluation set rather than the labels in the training set. And this finding applies not just to automated scoring but to many other areas too.
Could you describe some of your work at ETS?
I work on a speech scoring engine system that process spoken language in an educational context. One such system is SpeechRater®, which uses advanced speech recognition and analysis technology to assess and provide detailed feedback about English language speaking proficiency. SpeechRater is a very mature application that has been around for more than 10 years. I build scoring models for different applications and work with other colleagues across ETS to ensure that our scores are reliable, fair and valid for all test takers. We also work with other groups at ETS to continuously monitor system performance.
In addition to maintaining and improving our operational systems, we prototype new systems. One of the projects I am very excited about is RelayReader™: an application designed to help developing readers gain fluency and confidence. When reading with RelayReader, a user takes turns listening to and reading aloud a book. Their reading is then sent to our servers to provide feedback. In terms of speech processing, the main challenge of this application is how to measure learning and provide actionable and reliable feedback unobtrusively, without interfering with the reader’s engagement with the book.
What’s your favorite part of working with ETS?
What initially attracted me to ETS is that it is a non-profit organization with a mission to advance the quality of education for all people around the world. While of course it is great when research leads to a product, I appreciate having an opportunity to work on projects that are more foundational in nature but will help with product development in the future. I also cherish the fact that ETS takes issues such as data privacy and fairness very seriously and all our systems undergo very stringent assessment before being deployed operationally.
But what truly makes ETS a great place to work is its people. We have an amazing community of scientists, engineers and developers from many different backgrounds which allows for a lot of interesting collaborations.
Do you believe that an AI will ever be able to pass the Turing Test?
Since the 1950s, there have been a lot of interpretation of how the Turing test should be done in practice. There is probably a general agreement that the Turing test hasn’t been passed in a philosophical sense that there is no AI system that thinks like human. However, this has also become a very niche subject. Most people don’t build their systems to pass Turing test – we want them to achieve specific goals.
For some of these tasks, for example, speech recognition or natural language understanding, human performance may be rightly considered the gold standard. But there are also many other tasks where we would expect an automated system to do much better than humans or where an automated system and human expert need to work together to achieve the best result. For example, in an educational context we don’t want an AI system to replace a teacher: we want it to help teachers, whether it is through identifying patterns in student learning trajectories, help with grading or finding the best teaching materials.
Is there anything else that you would like to share about ETS or NLP?
Many people know ETS for its assessments and automated scoring systems. But we do much more than that. We have many capabilities from voice biometrics to spoken dialogue applications and we are always looking for new ways to integrate technology into learning. Now that many students are learning from home, we have opened several of our research capabilities to general public.
Thank you for the interview and for offering this insight on the latest advances in NLP and speech recognition. Anyone who wishes to learn more can visit Educational Testing Services.
Alexander Hudek, Co-Founder & CTO of Kira Systems – Interview Series
Alex Hudek is the Co-Founder & CTO of Kira Systems. He holds Ph.D and M.Math degrees in Computer Science from the University of Waterloo, and a B.Sc. from the University of Toronto in Physics and Computer Science.
His past research in the field of bioinformatics focused on finding similarities between DNA sequences. He has also worked in the areas of proof systems and database query compilation.
When did you initially become interested in machine learning and AI?
I’ve always been interested in computer science. In undergrad I took courses in algorithms for planning and logic, machine learning and AI, numerical computing, and other topics. My interest in machine learning grew more specifically during my PhD at the University of Waterloo. There, I used machine learning methods to study DNA. Afterwards, I dove more deeply into formal logics as part of my postdoctoral research. Logic and reasoning is in some ways the “other side” of the coin in approaches to AI and I felt it important to know more about it.
Some of your past research in the field of bioinformatics focused on finding similarities between DNA sequences. Could you discuss some of this work?
The main body of my thesis involved building a more realistic model DNA mutation using Hidden Markov Models. I used this more complex model in a new algorithm designed to find regions of DNA that share common ancestry with other species. In particular, this new algorithm can find much more weakly related sequence regions than previous algorithms for the task.
Before my PhD, I worked in a research lab that was part of the human genome project. One of the most notable projects I helped complete was the first complete draft of human chromosome 7.
What was the initial inspiration behind launching Kira?
The idea for Kira came from my co-founder, Noah Waisberg. He had spent hours in his career as a lawyer doing the sort of work we’ve now built AI to do. It was an interesting idea to me because it involved natural language and the problem was well scoped, and I could see the business potential. There is something alluring about building AI that can understand human language because language is so closely related to human cognition.
Can you describe what Contract Analysis Software is and how it benefits legal professionals?
Kira uses supervised machine learning, meaning an experienced lawyer feeds provisions from real contracts into a system designed to learn from those examples. The system studies this data, learns what language is relevant, and builds probabilistic provision models. The models are then tested against a set of annotated agreements that the system is unfamiliar with in order to determine its readiness. This highly accurate machine learning technology can identify and analyze virtually any provision in any contract, resulting in customer-reported time savings of 20-90%. This increased productivity helps Law Firms by increasing their Realization Rates, gives them more opportunity to grow their revenue and preserve their existing clients. For corporations, it drives better productivity in-house reducing the amount of external legal spend required.
Natural Language Processing (NLP) is difficult for most companies, could you discuss some of the additional challenges that are faced when it comes to processing legal terminology and other nuances that are unique to the legal profession?
For many people legal language can seem very foreign, but it turns out that from a machine learning perspective it’s not actually that different. There are a few more unique things; capitalization is more important and sentences can be much longer than normal, but overall we haven’t needed significantly different NLP approaches than in other domains.
One aspect that is significantly different is the need for data privacy and customization. Legal professionals are required to keep client data confidential, and using it in a machine learning product that pools or shares training data is at odds with those requirements. In fact, even keeping training data is often not possible as they have obligations to delete client data after a project concludes. Thus, being able to train models without vendors in the loop becomes critical, as do machine learning techniques that make it hard or impossible to recover any part of the training data by inspecting learned models. Techniques that allow you to take an existing model and update it with new training data without retraining from scratch are also a must have.
On the customization front, there is a need for clients to be able to build their own models. This is because for more complex legal concepts there can be reasonable disagreement among professionals, and firms often want to tune or build models to match their own unique positions.
Could you describe how deep learning is used to categorize data within Kira software?
We don’t use much deep learning in our product, though our internal research team does spend a lot of time evaluating and exploring deep learning solutions. So far, on the sorts of problems we face, deep learning techniques are only matching non-deep learning approaches, or at best getting a very small increase. Given the huge computation overhead of deep learning methodologies, as well as challenges in keeping training data private, they haven’t been compelling enough to adopt so far.
That said, we do find deep learning approaches to be very compelling and we think they have a potential to become big in NLP one day. To that end, we continually evaluate and explore deep learning NLP approaches so that we can be ready to adopt when the advantages start outweighing the disadvantages.
What are some of the built-in provision models that Kira offers?
Currently Kira can identify and extract over 1,000 built-in provisions, clauses, and data points (smart fields). They relate to a multitude of different topics, from M&A Due Diligence—which Kira was originally conceived to assist with—to Brexit; to Real Estate. The smart fields are built by our team of subject matter experts that include experienced lawyers and accountants. With our machine learning technology, Kira’s standards require virtually every smart field to achieve a minimum of 90% recall, meaning our software will find 90% or more of the provision, clause or data point you’re specifically looking for within your contracts or documents, reducing risks and errors in the contract review process. In addition, an unlimited number of custom fields can be created/taught by a firm to automatically identify and extract relevant insights using our Quick Study tool.
The legal world is often known for being slow to adopt new technology. Do you find that there’s an education hurdle when it comes to educating law firms?
Lawyers really like to know how things work, so education is important. It’s no harder to teach lawyers about machine learning and AI then other professionals, but it is definitely required to have training materials ready. Many of the adoption hurdles are social too; people often ask about best practices in adapting their internal processes to use AI, or are interested in how they can use AI to change their business offerings in a way that gives them advantages beyond just efficiency improvements.
Compared to when we started Kira Systems in 2011, law firms today are far more savvy about AI and technology. Many have innovation teams who are tasked with investigating new technology and encouraging adoption of new solutions.
Is there anything else that you would like to share about Kira?
Academic literature and open source machine learning libraries were instrumental in helping us bootstrap the company. We believe that open information and software is a huge boon to the world. In light of that, I’m especially happy that our research team publishes the results of many of our research efforts in academic journals and conferences. Aside from demonstrating that we push the boundaries of the state of the art, this allows us to give back to the communities that helped us get started, and that we continue to get a ton of value from. You can find our papers at https://kirasystems.com/science/.
To learn more visit Kira Systems.
DeepScribe AI Can Help Translate Ancient Tablets
Researchers from the University of Chicago’s Oriental Institute and the Department of Computer Science have collaborated to design an AI that can help decode tablets from ancient civilizations. According to Phys.org, the AI is called DeepScribe and was trained on over 6,000 annotated images pulled from the Persepolis Fortification Archive, when it is complete the AI model will be able to interpret unanalyzed tablets, making studying ancient documents easier.
Experts who study ancient documents, like the researchers who are studying the documents created during the Achaemenid Empire in Persia, need to translate ancient documents by hand, a long process that is prone to errors. Researchers have been using computers to assist in interpreting ancient documents since the 1990s, but the computer programs that were used were of limited help. The complex cuneiform characters, as well as the three-dimensional shape of the tablets, put a cap on how useful the computer programs could be.
Computer vision algorithms and deep learning architectures have brought new possibilities to this field. Sanjay Krishnan, from the Department of Computer Science at OI collaborated with associate professor of Assyriology Susanne Paulus to launch the DeepScribe program. The researchers oversaw a database management platform called OCHRE, which organized data from archaeological excavations. The goal is to create an AI tool that is both extensive and flexible, able to interpret scripts from digfferent geographical regions and time periods.
As Phys.org reported, Krishnan explained that the challenges of recognizing script, which archaeological researchers face, are essentially the same challenges faced by computer vision researchers:
“From the computer vision perspective, it’s really interesting because these are the same challenges that we face. Computer vision over the last five years has improved so significantly; ten years ago, this would have been hand wavy, we wouldn’t have gotten this far. It’s a good machine learning problem, because the accuracy is objective here, we have a labeled training set and we understand the script pretty well and that helps us. It’s not a completely unknown problem.”
The training set in question is the result of taking the tablets and translations, from over approximately 80 years of the archaeological research done at OI and U Chicago and making high-resolution annotated images from them. Currently, the training data is approximately 60 terabytes in size. Researchers were able to use the dataset and create a dictionary of over 100,000 individually identified signs that the model could learn from. When the trained model was tested on an unseen image set, the model achieved approximately 80% accuracy.
While the team of researchers is attempting to increase the accuracy of the model, even 80% accuracy can assist in the process of transcription. According to Paulus, the model could be used to identify or translate highly repetitive parts of the documents, letting experts spend their time interpreting the more difficult parts of the document. Even if the model can’t say with certainty what a symbol translates to, it can give researchers probabilities, which already puts them ahead.
The team is also aiming to make DeepScribe a tool that other archeologists can use in their projects. For instance, the model could be retrained on other cuneiform languages, or the model could make informed estimates about the text on damaged or incomplete tablets. A sufficiently robust model could potentially even estimate the age and origin of tablets or other artifacts, something typically done with chemical testing.
The DeepScribe project is funded by the Centre for the Development of Advanced Computing (CDAC). Computer vision has been used in other CDAC-funded projects as well, like a project intended to recognize style in works of art and a project designed to quantify biodiversity in marine bivalves. The team of researchers is also hoping their collaboration will lead to future collaborations between the Department of Computer Science and OI at the University of Chicago.
- AI Powered State Surveillance On Rise, COVID-19 Used as Scapegoat
- Anastassia Loukina, Senior Research Scientist (NLP/Speech) at ETS – Interview Series
- How Governments Have Used AI to Fight COVID-19
- Neural Hardware and Image Recognition
- Charles J. Simon, Author, Will Computers Revolt? – Interview Series