Facebook has recently developed a new machine translation model that can translate text between any given pair of languages out of a set of 100 languages. While other machine translation systems exist, most other AI translation systems operate by first translating text to English and then converting the text from there. As Engadget reported, Facebook’s AI translator operates without using the English language as a middleman, and is reportedly able to achieve approximately 90% accuracy.
Facebook’s training data for the AI model was composed of around 7.5 billion pairs of sentences, distributed across 100 different languages. The data was compiled from the web using a series of web crawlers, and the languages present in the collected data were identified using a language model called FastText. Once the data was collected, it was run through a tool called LASER 2.0 to extract the meaning of the different sentence samples and match sentences in different languages together based on their meaning. LASER 2.0 was developed by Facebook and it employs unsupervised learning algorithms to create embeddings. The sentence embeddings contain information about the relationships between different sentences based on features like frequency of use and how near sentences appear to each other. LASER 2.0 is then able to create pais of sentences that have highly similar meanings.
The training data wasn’t just paired based on sentence meanings. Languages themselves were grouped together. The goal was to design a system that didn’t require English to be used as a medium between two languages, with Facebook’s Angela Fan, who led the project, noting that many regions around the globe speak two languages that aren’t English. The Facebook engineers carried out training by focusing on pairing languages that are commonly translated to and from each other. Fourteen different language groups were created, based upon variables like culture, linguistic similarities, and geography. As an example, one of the linguistic groups created by the researchers contained the most common languages throughout India, which include the languages Urdu, Tamil, Hindi, and Bengali. This was done so that commonly paired languages would receive high-quality translations.
The language-group focused training method lead to some interesting results. It was found that the resulting translation model had greater accuracy than currently existing models for certain language pairings. When translating between English and Belarusian, for example, the AI was able to apply certain patterns it had learned when translating Russian because Belarusian has linguistic similarities with Russian. Similarly, translation efforts between Spanish and Portuguese improved since Spanish is the second most widely spoken language and there was a substantial volume of training data for the task.
There are approximately sixty languages that the translation system doesn’t cover yet, and the model’s accuracy on languages without a lot of training data needs to be improved before it is ready for use. Many languages across Southeast Asia and Africa lack the volume of data needed to train a reliable model. The research team will need to determine some way of compensating for this lack of data. The research team also needs to determine how to control for any racist, sexist, or otherwise profane patterns the model might have learned. While the research team has made use of a profanity filter, the filter works mainly on the English data.
The machine translation system hasn’t been employed on Facebook’s social media platform yet. The current model is for research purposes only. However, Facebook is gearing up to design similar models and have them handle the approximately 20 billion translation requests the site receives every day.