Connect with us

Natural Language Processing

Appen Limited Launches Diverse Data Training Sets for NLP

Published

 on

Appen Limited, a leading provider of high-quality training data for companies looking to build AI systems at scale, is launching new diverse training datasets for natural language processing (NLP) initiatives. These datasets will enable end users to receive the same experience regardless of language variety, dialect, ethnolect, accent, race or gender. 

According to a report by PNAS in March 2020, popular automated speech recognition (ASR) systems, especially those used for virtual assistants, closed captioning, and hands-free computing, often exhibit racial disparities in performance. Much of this has to do with the systems being based on biased or incomplete data, and this is why it is so crucial to develop diverse training sets. 

With the new launch, Appen aims to reduce the performance differences and create a more inclusive environment for speech recognition technology. The same types of challenges are present in language interpretation and NLP systems. 

Mark Brayan is Appen CEO. 

“The quality and diversity of training data directly impacts the performance and bias present in AI models,” said Brayan. “As a data partner, we can supply complete training data for many use cases to ensure AI models work for everyone. It’s critical that we engage a diverse group of individuals to produce, label, and validate the data to ensure the model being trained is not only equitable, but also built responsibly.”

Appen Language Projects

Appen attempts to create a diverse AI environment through its different projects and partnerships, including: 

  • Translators without Borders (TWB) partnership: Appen has partnered with TWB, Amazon, Carnegie Mellon University, Facebook, Google, Johns Hopkins University, Microsoft, and Translated. The partnership has joined the Translation Initiative for COVID-19 (TICO-19), which attempted to expand access to COVID-19 information by supporting the development of language technology in multiple languages. These include developing countries like Congolese Swahili, Tigrinya, and Nigerian Fulfulde.

  • Canadian French translation project: Appen helped Microsoft add “Canadian French” as a language option in Microsoft Translator after coordinating with native language consultants.
  • Inuktitut translation project: Appen collaborated with the Nunavut Government which helped lead to Microsoft adding Inuktitut to Microsoft Translator. The indigenous language is spoken in the Canadian Arctic.

  • African American Vernacular English (AAVE) off-the-shelf datasets: By working with AAVE speakers and collecting data for an OTS dataset based on conversations about various topics, Appen attempts to make new training datasets that represent AAVE. 

Dr. Judith Bishop is Senior Director of AI Specialists at Appen.

“Biased AI data leads to projects that can fail to deliver the expected business results and harm individuals they are supposed to benefit,” said Dr. Bishop. “The scale and complexity of AI projects makes it impossible for most companies to acquire sufficient unbiased high-quality data without partnering with an AI data expert. Appen’s commitment to developing the most diverse and expert crowd of data annotators provides the industry with a clearly differentiated resource for building fair and ethical AI projects.”

Appen is assisted by training data annotators from over 170 countries, and the language representations include 235 unique languages and 395 dialects. It also offers off-the-shelf (OTS) datasets, which enable businesses to acquire high-quality training data quicker for their AI projects.