Researchers at the University of Waterloo have developed an AI model that enables computers to process a wider variety of human languages. This is an important step forward in the field given how many languages are often left behind in the programming process. African languages often don’t get focused on by computer scientists, which has led to natural language processing (NLP) capabilities being limited on the continent.
The new language model was developed by a team of researchers at the University of Waterloo’s David R. Cheriton School of Computer Science.
The research was presented at the Multilingual Representation Learning Workshop at the 2021 Conference on Empirical Methods in Natural Language Processing.
The model is playing a key role in helping computers analyze text in African languages for many useful tasks, and it is being called AfriBERTa. It uses deep-learning techniques to achieve impressive results for low-resource languages.
Working With 11 African Languages
AfriBERTa works with 11 specific African languages as of right now, including Amharic, Hausa, and Swahili, which is spoken by a combined 400+ million people. The model has demonstrated output quality that is comparable to the best existing models, and it did so while only learning from one gigabyte of text. Other similar models often require thousands of times more data.
Kelechi Ogueji is a master’s student in computer science at Waterloo.
“Pretrained language models have transformed the way computers process and analyze textual data for tasks ranging from machine translation to question answering,” said Ogueji. “Sadly, African languages have received little attention from the research community.”
“One of the challenges is that neural networks are bewilderingly text- and computer-intensive to build. And unlike English, which has enormous quantities of available text, most of the 7,000 or so languages spoken worldwide can be characterized as low-resource, in that there is a lack of data available to feed data-hungry neural networks.”
Pre Training Technique
Most of these models rely on a pre-training technique, which involves the researcher presenting the model with text that has some of the words hidden or masked. The model then must guess the hidden words, and it continues to repeat this process billions of times. It eventually learns the statistical associations between words, which is similar to the human knowledge of language.
Jimmy Lin is the Cheriton Chair in Computer Science and Ogueji’s advisor.
“Being able to pretrain models that are just as accurate for certain downstream tasks, but using vastly smaller amounts of data has many advantages,” said Lin. “Needing less data to train the language model means that less computation is required and consequently lower carbon emissions associated with operating massive data centres. Smaller datasets also make data curation more practical, which is one approach to reduce the biases present in the models.”
“This work takes a small but important step to bringing natural language processing capabilities to more than 1.3 billion people on the African continent.”
The research also involved Yuxin Zhu, who recently finished an undergraduate degree in computer science at the university.