Nitin Madnan is a Senior Research Scientist with the Natural Language Processing (NLP) research group at the Educational Testing Service (ETS). ETS was founded in 1947, and is the world’s largest private nonprofit educational testing and assessment organization.
Could you begin by explaining what is the mission of ETS?
ETS’s mission is to advance quality and equity in education for all learners worldwide. This mission underlies our products, services, research, and development efforts with the aim to promote learning, support education, professional development, and measure knowledge and skills, for everyone.
We believe that anyone, anywhere can make a difference in their lives through learning and ETS’s work on research, assessment, measurement, and policy can play an important role in making that learning possible.
What is it about NLP that has you so passionate?
All human languages are so beautifully complex and messy. They allow us to express a range of emotions in our speech and even in our writing and they evolve with time. On the other hand, a computer is so deterministic and clinical in processing its inputs. Natural Language Processing (NLP) is an area of artificial intelligence that tries to make this supremely non-human device understand the beautiful complexities of human language by combining techniques from Computer Science, Linguistics, and Statistics. How could you not find this fascinating?
ETS NLP & speech scientists have recently developed RSMTool. Could you share with us what the RSMTool does?
As we have seen in the past few years, all machine learning models can potentially exhibit biased behavior regardless of the field in which they’re applied, education being no exception. The automated grading systems used to assign scores or grades to students’ speech or essays in tests or in classrooms often use machine learning models. Therefore, it is absolutely possible for such systems to behave in a biased manner. Such bias can have serious consequences especially if the scores from such systems are used to make high-stakes decisions.
RSMTool is an open-source tool that my colleague Anastassia Loukina (previously featured on Unite.AI) and I developed at ETS to help ensure that any systematic, harmful biases in automated grading systems are identified as early as possible, hopefully even before the systems are deployed in the real world. RSMTool is designed to provide a comprehensive evaluation of AI scoring engines including not only standard metrics of prediction accuracy, but also measures of model fairness, and metrics based on test theory, helping developers of such engines identify possible biases or other problems in their systems.
Where does the name RSMTool come from?
In the educational assessment field, someone who assigns a score to (or “rates”) an essay is often referred to as a “rater.” There are human raters as well as automated raters. RSMTool – short for Rater Scoring Modeling Tool – is designed to help build (and evaluate) the scoring models used by automated raters.
How can this tool assist developers to identify possible bias or other problems in their AI scoring engines?
In the last five decades, educational measurement scientists – including many of our colleagues at ETS – have conducted valuable research on what makes automated (and human) scoring fair. As part of this research, they have developed many statistical and psychometric analyses for computing indicators of systematic bias. However, since the psychometric and NLP communities seldom interact, there is little opportunity for cross-pollination of ideas. The upshot is that NLP researchers and developers who are building actual automated scoring systems – especially individual researchers and those in small companies – do not have easy access to the psychometric analyses they should be using to check their systems for bias. RSMTool attempts to solve this problem by providing a large, diverse set of psychometric analyses in a single, easy-to-use Python package that can be easily incorporated by any NLP researcher into their research or operational pipeline.
In a typical use case, a researcher would provide as input a file or a data frame with the numeric system scores, gold-standard (human) scores, and metadata, if applicable. RSMTool processes this data and generates an HTML report containing a comprehensive evaluation including descriptive statistics as well as multiple measures of system performance and fairness among others. A sample RSMTool report can be found at https://bit.ly/fair-tool. RSMTool can work with traditional feature-driven machine learning models (e.g., from the scikit-learn library) and with deep learning models. Although the primary output of RSMTool is the HTML report that makes for easier sharing, it also generates tabular data files (in CSV, TSV, or XLSX formats) as intermediate outputs for more advanced users. Finally, to keep things extremely customizable, RSMTool implements each section of its report as a Jupyter notebook so that the users can not only choose which sections are relevant to their specific scoring models, they can also easily implement custom analyses and include them in the report with very little work.
What are the common types of bias that may impact automated scoring systems?
The most common type of bias affecting an automated scoring system is differential sub-group performance, i.e., when the automated system performs differently for different population sub-groups. For example, a biased scoring system could produce systematically lower scores for essays written by, for example, Black females compared to those for White males, even though there may be no systematic differences in the actual writing skills displayed by those two sub-groups in their essays, as far as a human is concerned.
ETS has a rich history of conducting research on fairness for automated scoring engines. For example, we have looked at whether e-rater® – our AI automated scoring engine – exhibits any differential performance for subgroups defined by ethnicity, gender, and country (they found some minor differences that were addressed by subsequent policy changes). Studies have also looked at whether e-rater® treats responses written by GRE® test-takers with learning disabilities and/or ADHD systematically differently on average (it does not). Most recently, a timely study looks at whether an automated system for scoring speaking proficiency exhibits any systematic bias towards test-takers who were required to wear face masks versus those who did not wear face masks (it does not). RSMTool contains several psychometric analyses that attempt to quantify differential subgroup performance over subgroups that the user can define over their own data.
ETS chose to make the RSMTool open-source, could you explain the reasoning and importance behind this?
Yes, RSMTool is available on GitHub with an Apache 2.0 license. We believe that it is important for such a tool to be open-source and non-proprietary so that the community can (a) audit the source code of the already available analyses to ensure their compliance with fairness standards and (b) contribute new analyses as the standard evolve and change. We also want to make it easy for NLP researchers and developers to use RSMTool in their work and to help us in making it better. Making RSMTool open-source is a clear example of ETS’s continued commitment to the responsible use of AI in education.
What are some of the lessons you learned from developing & maintaining RSMTool?
Over the last five years that Anastassia and I have developed and maintained RSMTool – with the help of many ETS colleagues and non-ETS GitHub contributors – we have learned two overarching lessons. The first being that different users have different needs and having a single-size-fits-all approach will not work for cross-disciplinary software like RSMTool. The second lesson we learned was that in order to make it more likely for open-source software to be adopted, you really have to go the extra mile to make it as robust as possible.
In our tenure as RSMTool maintainers, we have identified many types of users of RSMTool. Some of them are “power users” (e.g., NLP researchers and developers) who want to pick and choose specific RSMTool functionality to plug into their own machine learning pipeline while also using other Python packages. To satisfy such users, we ended up creating a pretty comprehensive API to expose various pre- and post-processing functions as well as custom metrics contained in RSMTool. Another group of users are what we call “minimalists”: data analysts and engineers who may lack the statistical or programming background to interact with the API and prefer an out-of-the-box pipeline instead. To satisfy such users, we have created command-line tools that can easily be called in wrapper shell scripts, for example. We have also found that minimalist users are often reluctant to read through the (admittedly large) list of RSMTool configuration options. Therefore, we built an interactive configuration generator with autocompletion that can help such users create configuration files based on their specific needs.
In order to meet the needs for all our user groups, we have had to adopt practices that we believed were necessary to make RSMTool robust. What do we mean by robust software? To be robust, any piece of software must meet the following criteria: the impact of any code change on its accuracy and performance can be measured (well-tested), its documentation is always up-to-date (well-documented), and the software (along with its dependencies) is easily installable by users. For RSMTool, we have leveraged several open-source tools and services to make it meet our definition. We have a comprehensive test suite (>90% code coverage) that we automatically run via continuous integration for any and all changes submitted to the code. We maintain extensive documentation (including multiple real-world tutorials) and any new functionality proposed for RSMTool must include a documentation component that is also reviewed as part of the code review. Finally, we release RSMTool as packages that can be easily installed (via either pip or conda) and all dependencies needed are also automatically installed.
What does ETS hope to achieve by releasing the RSMTool?
The education sector has seen one of the most significant expansions of AI over the last few years with automated scoring of text and speech becoming an increasingly common application of NLP. ETS has long been a leader in the field of automated scoring and, since its inception, has been committed to building fair products and assessments that are designed to serve learners worldwide. By releasing RSMTool, developed in close collaboration between NLP scientists and psychometricians, ETS wants to continue its advocacy for the responsible use of AI in education in a very tangible way; specifically, we want to make it clear that when AI researchers think about the “performance” of an automated scoring system, they should consider not only the standard metrics of prediction accuracy (e.g. Pearson’s correlation) but also those of model fairness. More broadly, we would also like RSMTool to serve as an example of ways in which NLP researchers and psychometricians can and should work together.
Is there anything else that you would like to share about the RSMTool?
We want to encourage readers to help us improve RSMTool! They do not need to be a psychometrician or an NLP expert to contribute. We have many open issues related to documentation and to Python programming that would be perfect for any beginner to intermediate Python programmer. We also invite contributions to SKLL (Scikit-Learn Laboratory), – another ETS open-source package for efficiently running user-configurable, batched machine learning experiments – that is used underlyingly by RSMTool.