Researchers from the University of Chicago’s Oriental Institute and the Department of Computer Science have collaborated to design an AI that can help decode tablets from ancient civilizations. According to Phys.org, the AI is called DeepScribe and was trained on over 6,000 annotated images pulled from the Persepolis Fortification Archive, when it is complete the AI model will be able to interpret unanalyzed tablets, making studying ancient documents easier.
Experts who study ancient documents, like the researchers who are studying the documents created during the Achaemenid Empire in Persia, need to translate ancient documents by hand, a long process that is prone to errors. Researchers have been using computers to assist in interpreting ancient documents since the 1990s, but the computer programs that were used were of limited help. The complex cuneiform characters, as well as the three-dimensional shape of the tablets, put a cap on how useful the computer programs could be.
Computer vision algorithms and deep learning architectures have brought new possibilities to this field. Sanjay Krishnan, from the Department of Computer Science at OI collaborated with associate professor of Assyriology Susanne Paulus to launch the DeepScribe program. The researchers oversaw a database management platform called OCHRE, which organized data from archaeological excavations. The goal is to create an AI tool that is both extensive and flexible, able to interpret scripts from digfferent geographical regions and time periods.
As Phys.org reported, Krishnan explained that the challenges of recognizing script, which archaeological researchers face, are essentially the same challenges faced by computer vision researchers:
“From the computer vision perspective, it’s really interesting because these are the same challenges that we face. Computer vision over the last five years has improved so significantly; ten years ago, this would have been hand wavy, we wouldn’t have gotten this far. It’s a good machine learning problem, because the accuracy is objective here, we have a labeled training set and we understand the script pretty well and that helps us. It’s not a completely unknown problem.”
The training set in question is the result of taking the tablets and translations, from over approximately 80 years of the archaeological research done at OI and U Chicago and making high-resolution annotated images from them. Currently, the training data is approximately 60 terabytes in size. Researchers were able to use the dataset and create a dictionary of over 100,000 individually identified signs that the model could learn from. When the trained model was tested on an unseen image set, the model achieved approximately 80% accuracy.
While the team of researchers is attempting to increase the accuracy of the model, even 80% accuracy can assist in the process of transcription. According to Paulus, the model could be used to identify or translate highly repetitive parts of the documents, letting experts spend their time interpreting the more difficult parts of the document. Even if the model can’t say with certainty what a symbol translates to, it can give researchers probabilities, which already puts them ahead.
The team is also aiming to make DeepScribe a tool that other archeologists can use in their projects. For instance, the model could be retrained on other cuneiform languages, or the model could make informed estimates about the text on damaged or incomplete tablets. A sufficiently robust model could potentially even estimate the age and origin of tablets or other artifacts, something typically done with chemical testing.
The DeepScribe project is funded by the Centre for the Development of Advanced Computing (CDAC). Computer vision has been used in other CDAC-funded projects as well, like a project intended to recognize style in works of art and a project designed to quantify biodiversity in marine bivalves. The team of researchers is also hoping their collaboration will lead to future collaborations between the Department of Computer Science and OI at the University of Chicago.
Paraphrase Generation Using Deep Reinforcement Learning – Thought Leaders
When writing or talking we’ve all wondered whether there is a better way of communicating an idea to others. What words should I use? How should I structure the thought? How are they likely to respond? At Phrasee, we spend a lot of time thinking about language – what works and what doesn’t.
Imagine you are writing the subject line for an email campaign that will go to 10 million people in your list promoting 20% off a fancy new laptop.
Which line would you pick:
- You can now take an extra 20% off your next order
- Get ready – an extra 20% off
While they convey the same information, one achieved an almost 15% higher open rate than the other (and I bet you can’t beat our model at predicting which one ?). While language can often be tested through A/B testing or multi-armed bandits, automatically generating paraphrases remains a really challenging research problem.
Two sentences are considered paraphrases of one another if they share the same meaning and can be used interchangeably. Another important thing which is often taken for granted is whether a machine generated sentence is fluent.
Unlike supervised learning, Reinforcement Learning (RL) agents learn through interacting with their environment and observing the rewards they receive as a result. This somewhat nuanced difference has massive implications for how the algorithms work and how the models are trained. Deep Reinforcement Learning uses neural networks as a function approximator to allow the agent to learn how to outperform humans in complex environments such as Go, Atari, and StarCraft II.
Despite this success, reinforcement learning has not been widely applied to real-world problems including Natural Language Processing (NLP).
As part of my MSc thesis in Data Science, we demonstrate how Deep RL can be used to outperform supervised learning methods in automatically generating paraphrases of input text. The problem of generating the best paraphrase can be viewed as finding the series of words which maximizes the semantic similarity between sentences while maintaining fluency in the output. RL agents are well-suited for finding the best set of actions to achieve the maximum expected reward in control environments.
In contrast with most problems in machine learning, the largest problem in most Natural Language Generation (NLG) applications does not lie in the modelling but rather in the evaluation. While human evaluation is currently considered the gold standard in NLG evaluation, it suffers from significant disadvantages including being expensive, time-consuming, challenging to tune, and lacking reproducibility across experiments and datasets (Han, 2016). As a result, researchers have long been searching for automatic metrics which are simple, generalizable, and which reflect human judgment (Papineni et al., 2002).
The most common automatic evaluation methods in evaluating machine generated image captions are summarized below with their pros and cons:
Paraphrase Generation using Reinforcement Learning Pipeline
We developed a system named ParaPhrasee which generates high quality paraphrases. The system consists of multiple steps in order to apply reinforcement learning in a computationally efficient way. A brief summary of the high-level pipeline is shown below with more detail contained in the thesis.
There are several paraphrase datasets available that are used in research including: the Microsoft Paraphrase corpus, ACL’s Semantic Text Similarity competition, Quora Duplicate Questions, and Twitter Shared Links. We have selected MS-COCO given its size, cleanliness, and use as a benchmark for two notable paraphrase generation papers. MS-COCO contains 120k images of common scenes with 5 image captions per image provided by 5 different human annotators.
While it is primarily designed for computer vision research, the captions tend to have high semantic similarity and are interesting paraphrases. Given the image captions are provided by different people they tend to have slight variations in detail provided in the scene therefore the generated sentences tend to hallucinate details.
While reinforcement learning has improved considerably in terms of sample efficiency, training times, and overall best practices, training RL models from scratch is still comparatively very slow and unstable (Arulkumaran et al., 2017). Therefore, rather than train from scratch, we first train a supervised model and then fine-tune it using RL.
We use an Encoder-Decoder model framework and evaluate the performance of several baseline supervised models. When fine-tuning the model using RL, we only fine-tune the decoder network and treat the encoder network as static. As such we consider two main frameworks:
- Training the supervised model from scratch using a standard/vanilla encoder decoder with GRUs
- Using pretrained sentence embedding models for the encoder including: pooled word embeddings (GloVe), InferSent, and BERT
The supervised models tend to perform fairly similarly across models with BERT and the vanilla encoder-decoder achieving the best performance.
While the performance tends to be reasonable, there are three common sources of error: stuttering, generating sentence fragments, and hallucinations. These are the main problems that using RL aims to solve.
Reinforcement Learning Model
Implementing RL algorithms is very challenging especially when you don’t know if the problem can be solved. There can be problems in the implementation of your environment, your agents, your hyperparameters, your reward function, or a combination of all of the above! These problems are exacerbated when doing deep RL as you get the fun of the added complexity of debugging neural networks.
As with all debugging, it is crucial to start simple. We implemented variations of two well understood toy RL environments (CartPole and FrozenLake) to test RL algorithms and find a repeatable strategy for transferring knowledge from the supervised model.
We found that using an Actor-Critic algorithm outperformed REINFORCE in these environments. In terms of transferring knowledge to the actor-critic model, we found that initializing the actor’s weights with the trained supervised model and pretraining the critic achieved the best performance. We found it challenging to generalize sophisticated policy distillation approaches to new environments as they introduce many new hyperparameters which require tuning to work.
Supported by these insights, we then turn to developing an approach for the paraphrase generation task. We first need to create an environment.
The environment allows us to easily test the impact of using different evaluation metrics as reward functions.
We then define the agent, given its many advantages we use an actor-critic architecture. The actor is used to select the next word in the sequence and has its weights initialized using the supervised model. The critic provides an estimate of the expected reward a state is likely to receive to help the actor learn.
Designing the Right Reward Function
The most important component of designing a RL system is the reward function as this is what the RL agent is trying to optimize. If the reward function is incorrect, then the results will suffer even if every other part of the system works!
A classic example of this is CoastRunners where the OpenAI researchers set the reward function as maximizing the total score rather than winning the race. The result of this is the agent discovered a loop where it could get the highest score by hitting turbos without ever completing the race.
Given evaluating the quality of paraphrases is itself an unsolved problem, designing a reward function that automatically captures this objective is even harder. Most aspects of language do not decompose nicely into linear metrics and are task dependent (Novikova et al., 2017).
The RL agent often discovers an interesting strategy to maximize rewards which exploits the weaknesses in the evaluation metric rather than generating high quality text. This tends to result in poor performance on metrics which the agent is not directly optimizing.
We consider three main approaches:
- Word-overlap Metrics
Common NLP evaluation metrics consider the proportion of word overlap between the generated paraphrase and the evaluation sentence. The greater the overlap the greater the reward. The challenge with word-level approaches is the agent includes too many connecting words such as “a is on of” and there is no measure of fluency. This results in very low-quality paraphrases.
- Sentence-level Similarity and Fluency Metrics
The main properties of a generated paraphrase are that it must be fluent and semantically similar to the input sentence. Therefore, we try to explicitly score these individually then combine the metrics. For semantic similarity, we use the cosine similarity between sentence embeddings from pretrained models including BERT. For fluency, we use a score based on the perplexity of a sentence from GPT-2. The greater the cosine similarity and fluency scores the greater the reward.
We tried many different combinations of sentence embedding models and fluency models and while the performance was reasonable, the main issue the agent faced was not sufficiently balancing semantic similarity with fluency. For most configurations, the agent prioritized fluency resulting in removing detail and most entities being placed “in the middle” of something or being moved “on a table” or “side of the road”.
Multi-objective reinforcement learning is an open research question and is very challenging in this case.
- Using an Adversarial Model as a Reward Function
Given humans are considered the gold standard in evaluation, we train a separate model called the discriminator to predict whether or not two sentences are paraphrases of one another (similar to the way a human would evaluate). The goal of the RL model is then to convince this model that the generated sentence is a paraphrase of the input. The discriminator generates a score of how likely the two sentences are to be paraphrases of one another which is used as the reward to train the agent.
Every 5,000 guesses the discriminator is told which paraphrase came from the dataset and which was generated so it can improve its future guesses. The process continues for several rounds with the agent trying to fool the discriminator and the discriminator trying to differentiate between the generated paraphrases and the evaluation paraphrases from the dataset.
After several rounds of training, the agent generates paraphrases that outperform the supervised models and other reward functions.
Conclusion and Limitations
Adversarial approaches (including self-play for games) provide an extremely promising approach for training RL algorithms to exceed human level performance on certain tasks without defining an explicit reward function.
While RL was able to outperform supervised learning in this instance, the amount of extra overhead in terms of code, computation, and complexity is not worth the performance gain for most applications. RL is best left to situations where supervised learning cannot be easily applied, and a reward function is easy to define (such as Atari games). The approaches and algorithms are far more mature in supervised learning and the error signal is much stronger which results in much faster and more stable training.
Another consideration is, as with other neural approaches, that the agent can fail very dramatically in cases where the input is different from the inputs it has previously seen, requiring an additional layer of sanity checks for production applications.
The explosion of interest in RL approaches and advances in computational infrastructure in the last few years will unlock huge opportunities for applying RL in industry, especially within NLP.
OpenAI to Release Commercial Version of Famous Text Generation Model
The artificial intelligence research lab OpenAI will be making its famous text generation tool GPT-3 available for purchase, marking the first commercial product produced by the machine learning non-profit.
In February of last year, OpenAI made their text generation model GPT-2 open source. GPT-2 was able to generate impressively coherent text when provided with a short prompt, and the model was regarded as a significant step forward in the fields of natural language processing and artificial intelligence. In fact, OpenAI initially refused to release the GPT-2model, claiming that it was too dangerous for an open-source release, fearing misuse. However, after a version of the model became available for some time, and the nonprofit reported that they hadn’t seen any evidence of malicious usage, a decision was made to make the model open source. Just recently, OpenAI announced the follow-up to GPT-2, dubbed GPT-3.
The GPT-3 model is approximately 100 times larger than its predecessor, and it’s this model that they are offering as a commercial product. The GPT-2 model was comprised of approximately 1.5 billion parameters while GPT-3 is comprised of approximately 175 billion parameters. The sophistication and reliability of the GPT model series have caused the models to converge on becoming the standard for machine learning projects involving text. Much like how convolutional neural networks have become the default model for image-related projects, text related projects are increasingly utilizing GPT-2, and perhaps soon GPT-3.
As reported by The Verge, OpenAI seems as if they are hoping to drive the adoption of a GPT standard forward by making their GPT-3 model commercially available. Currently, access to the GPT-3 API is invite-only. However, this is likely just a test run and it seems probable that GPT-3 will be made more widely available sooner rather than later. OpenAI has stated that it will be vetting customers of their models to prevent the abuse of technology for unethical uses like spamming, creating misinformation, or harassment.
It’s somewhat unclear exactly how GPT-3 is intended to be used by customers. This is arguably because GPT-3 is so flexible and has so many potential applications. GPT based models can be fine-tuned as necessary to generate text appropriate for specific tasks. The model is able to adapt to a wide variety of various prompts and input styles, meaning that it could be used it to create text summarization tools, answer simple questions, or act as the basis for a sophisticated chatbot. As OpenAI’s chief technology officer Greg Brockman explained to Wired:
“The big mental shift is, it’s much more like talking to a human than formatting things for a machine,” says Greg Brockman, OpenAI’s chief technology officer. “You give it a few questions and answers and suddenly it’s in Q&A mode.”
It’s unknown just how reliable the GPT-3 derived commercial models will be. While GPT-3 can generate syntactically correct, natural-sounding language, it lacks an intuitive understanding of the real world and how concepts relate to one another. This could prove to be problematic for uses where preserving context is especially important, such as chatbots designed for customer support.
Joe Davison, a research engineer at Hugging Face, expressed to Digital Trends that the sheer size of GPT-3 could limit its potential uses. Davison argued that while OpenAI has meaningfully advanced the state of the art when it comes to text generation, demonstrating that creating general-purpose models can reduce the need for task-specific data, the computation resources required to make use of GPT-3 make it fairly impractical for many companies.
OpenAI was previously a non-profit enterprise, but since 2019 the lab has shifted to a for-profit model. GPT-3 is the first major commercial product produced by the lab, and it remains to be seen if the switch to a for-profit model will shift OpenAi’s research priorities.
Akilesh Bapu, Founder & CEO of DeepScribe – Interview Series
Akilesh Bapu is the Founder & CEO of DeepScribe, which uses natural language processing (NLP) and advanced deep learning to generate accurate, compliant, and secure notes of doctor-patient conversations.
What was it that introduced and attracted you to AI and natural language processing?
If I remember correctly, Jarvis from “Iron Man” was the first thing that really attracted me to the world of natural language processing and AI. Particularly, I found it fascinating how much faster a human was able to not only go through tasks but also go into an incredible level of depth into certain tasks and unveil certain information that they wouldn’t have even known about if it weren’t for this AI.
It was this concept of “AI by itself won’t be as good as humans at most tasks but put a human and AI together and that combination will dominate.” Natural language processing is the most efficient way for this human/AI combination to happen.
From then on, I was obsessed with Siri, Google Now, Alexa, and the others. While they didn’t work as seamlessly as Jarvis, I so badly wanted to make them work as Jarvis did. Particularly, what became apparent was, commands such as “Alexa do this,” “Alexa do that,” were pretty easy and accurate to do with the current state of technology. But when it comes to something like Jarvis, where it can actually learn and understand, filter, and pick up on important topics during another conversational exchange—that hadn’t really been done before. This actually directly relates to one of my core motivations in founding DeepScribe. While we are solving the issue of documentation for physicians, we’re attempting a whole new wave of intelligence while doing it: ambient intelligence. AI that can dig through your day-to-day utterances, find useful information, and use that information to help you out.
You previously did some research using deep learning and NLP at UC Berkeley College of Engineering. What was your research on?
Back at the Berkeley AI Research Lab, I was working on a gene ontology annotator project where we were summarizing PubMed articles with specific output parameters.
The high-level overview: Take a task like the CNN news article summarization. In that task you’re taking news articles and summarizing them into roughly a few sentences. In your favor you have data and the ability to train these models on over a million articles. However, the problem space is enormous since you have limited structure to the summaries. In addition, there is hardly any structure to the actual articles. While there have been quite a few improvements since 2.5 years ago when I was working on this project, this is still an unsolved problem.
In our research project, however, we were developing structured summaries of articles. A structured summary in this case is similar to a typical summary except we know the exact structure of the output summary. This is helpful since it dramatically reduces the output options for our machine learning model—the challenge was that there was not enough annotated training to run a data-hungry deep learning model and get usable results.
The core of the work I did on this project was to leverage the knowledge we have around the input data and develop an ensemble of shallow ML models to support it—a technique we invented called the 2-step annotator. The 2-step annotator benchmarked at nearly 20x the accuracy as the previous best (54 percent vs 3.6 percent).
While side by side, this project and DeepScribe may sound entirely different, they were highly similar in how they used the 2-step annotation method to vastly improve results on a limited dataset.
What was the inspiration behind launching DeepScribe?
It all started with my father, who was a medical oncologist. Before electronic health record systems took over health care, physicians would jot down things on paper and spend very little time on notes. However, once EHRs started becoming popular as part of the HITECH Act of 2009, I started noticing that my dad spent more and more time at the computer. He’d start coming home later. On the weekends, he’d be sitting on the couch dictating notes. Simple things like him picking me up from school or basketball practice became a thing of the past as he’d be spending most of his evening hours catching up on documentation.
As a nerdy kid growing up, I would try to find solutions for him by searching the web and having him try them out. Sadly, nothing worked well enough to save him from the long hours of documentation.
Fast forward several years to the summer of 2017—I’m a researcher working at the Berkeley AI Research Lab, working on projects in document summarization. One summer when I’m back at home, I notice that my dad is still spending copious amounts of time documenting. I ask, “What’s new in the world of documentation? Alexa is everywhere, Google Assistant is so good now. Tell me, what’s the latest in the medical space?” And his answer was, “Nothing has changed.” I thought that it was just him but when I went and surveyed several of his colleagues, it was the same issue: not what the latest is in cancer treatment or the novel problems their patients were having—it was documentation. “How can I get rid of documentation? How can I save time on documentation? It’s taking so much of my time.”
I also noticed several companies that had emerged to try to solve documentation. However, either they were too expensive (thousands of dollars per month) or they were too minimal in terms of technology. The physicians at that time had very few options. That was when the opportunity opened up that if we could create an artificially intelligent medical scribe, a technology that could follow physicians’ patient visits and summarize them, and offer it at a cost that could make it accessible for everyone, it could truly bring the joy of care back to medicine.
You were only 22 years old when you launched DeepScribe. Can you describe your journey as an entrepreneur?
At Berkeley, I continued to delve into the world of entrepreneurship as much as possible, primarily with their wide array of classes. My favorites were:
- The Newton Lecture Series—people like Jessica Mah from InDinero or Diane Greene from VMWare who were Cal alums gave highly relatable talks about their time at Berkeley and how they started their own companies
- Challenge Lab—I actually met my co-founder Matt Ko through this class. We were placed in groups and went through a semester-long journey of creating a product and being mentored on what it takes during the early stages to get an idea going.
- Lean Launchpad—By far my favorite of the three; this was a grueling and rigorous process where we were guided by Steve Blank (acclaimed billionaire and the man behind the lean startup movement) to take an idea, validate it through 100 customer interviews, build a financial model, and more. This was the type of class where we pitched our “startup” only to get stopped on slide 1 or 2 and get grilled. If that wasn’t hard enough, we were also expected to interview 10 customers a week. Our idea at the time was to create a patent search that would give similar results to an expensive prior art search, which meant we were pitching to 10 enterprise customers a week. It was great because it taught us to think fast on our feet and be extra resourceful.
DeepScribe started when an investor group called The House Fund was writing checks for students who would turn down their summer internships and spend their summer building their company. We had just shut down Delphi (the patent search engine) and Matt and I had been constantly talking about medical documentation and everything fell in place since it was the perfect time to give it a shot.
With DeepScribe, we were lucky to have just come fresh out of Lean Launchpad since one of the most important factors in building a product for physicians was to iterate and refine the product around customer feedback. A historical issue with the medical industry has been that software has rarely had physicians in the design loop, therefore resulting in software that wasn’t optimized for the end user.
Since DeepScribe was happening at the same time as my final year at Berkeley, it was a heavy balancing act. I’d show up to class in a suit so I could be on time for a customer demo right after. I’d use all the EE facilities and professors not for anything to do with class but 100 percent for DeepScribe. My meetings with my research mentor even turned into DeepScribe brainstorming sessions.
Looking back, if I had to change one thing about my journey, it would’ve been to put college on hold so I could spend 150 percent of my time on DeepScribe.
Can you describe for a medical professional what the advantages of using DeepScribe are versus the more traditional method of voice dictation or even taking notes?
Using DeepScribe is meant to be very similar to using an actual human scribe. As you talk naturally to your patient, DeepScribe will listen in and pick up on the medically relevant speech that usually goes in your notes and puts it in there for you, using the same medical language that you yourself use. We like to think of it as a new AI-powered member of your medical staff that you can train as you’d like to help with documentation in your electronic health record system as you’d like. It’s very different from using voice dictation service as it eliminates the entire step of having to go back and document. While typical dictation services turn 10 minutes of documentation into 7-8 minutes, DeepScribe turns it into a few seconds. Our physicians report anywhere from 1.5 to 3 hours of time saved per day depending on how many patients they see.
DeepScribe is device-agnostic, operable from an iPhone, Apple Watch, browser (for telemedicine), or hardware device.
What are some of the speech recognition or NLP challenges that DeepScribe may encounter due to complex medical terminology?
Contrary to popular opinion, complex medical terminology is actually the easiest part for DeepScribe to pick up. The trickiest part for DeepScribe is to pick up on unique contextual statements a patient may give a physician. The more they stray from a typical conversation, the more we see the AI stumble. But as we collect more conversational data, we see it improve on this dramatically every day.
What are the other machine learning technologies that are used with DeepScribe?
The large umbrellas of speech recognition and NLP tend to cover most of the machine learning we’re doing at DeepScribe.
Can you name some of the hospitals, nonprofits, or academic institutions that are using DeepScribe?
DeepScribe started out through a pilot program with the UC Berkeley Health Center. Hartford Healthcare, Texas Medical Center, and Cedar Valley Medical Specialists are a handful of the larger systems DeepScribe is working with.
However, the larger percentage of DeepScribe users are 50 private practices from Alaska to Florida. Our most popular specialties are primary care, orthopedics, gastroenterology, cardiology, psychiatry, and oncology, but we do support a handful of other specialties.
DeepScribe has recently launched a program to assist with COVID-19. Could you walk us through this program?
COVID-19 has hit our doctors hard. Practices are only seeing 30-40 percent of their patient load, scribe staffing is being cut, and providers are being forced to rapidly switch all their patients on to telemedicine. All this ends up leading to more clerical work for providers—we at DeepScribe firmly believe that in order for this pandemic to come to a halt, physicians must devote 100 percent of their attention and time to taking care of their patients.
To help aid this cause, we are proud to launch a free telemedicine solution to health care professionals fighting this pandemic. Our telemedicine solution is fully integrated with our AI-powered medical scribe solution, eliminating the need for clinical documentation for encounters made on our platform.
We’re also offering our scribe service for free during the pandemic. This means that any physician can get access to a scribe for free to handle their documentation. Our hopes are that by doing this, physicians will be able to focus more of their attention on their patients and spend less time thinking about documentation, leading to a faster halting of the COVID-19 outbreak.
Thank you for the great interview, I really enjoyed learning about DeepScribe and your entrepreneurial journey. Anyone who wishes to learn more should visit DeepScribe.
- Phil Duffy, VP of Product, Program & UX Design at Brain Corp – Interview Series
- Adi Singh, Product Manager in Robotics at Canonical – Interview Series
- Clearview AI Halts Facial Recognition Services in Canada Amid Investigation
- Mike Lahiff, CEO at ZeroEyes – Interview Series
- U.S. Sees First Case of Wrongful Arrest Due to Bad Algorithm