Name-based demographic ascription tools misgender and misrecognize race and ethnicity
A new study in Nature Human Behavior calls for more accurate, more ethical tools — and more interesting research.
By Sarah Steimer
When using name-based demographic ascription tools, scholars should use caution. New research published in Nature Human Behavior found that there are substantial inequalities in how these tools misgender and misrecognize race and ethnicity. The study found the tools distribute erroneous ascriptions unevenly among other demographic traits, and that researchers should be wary of the potential empirical and ethical consequences of these errors.
The research team, led by Jeff Lockhart, a James S. McDonnell Postdoctoral Fellow in the Department of Sociology, evaluated the accuracy of name-based demographic ascription programs. These tools are used across sectors for market research, app developers, and elsewhere — including in research. Some of the most popular tools for gender imputation have a collective 945 citations in GoogleScholar.
To test the tools’ gender and racial accuracy, the team surveyed 19,924 authors of social science journal articles. They examined gender and racial misclassification in a trans- and nonbinary-inclusive way, along with nationality, sexuality, disability, parental education, and name changes. The team combined names from a database of publications without demographic data with original surveys of self-reported demographic data. The overall error rate for gender prediction was 4.6% in their sample using the most popular tool, genderize.io. However, they found drastic differences in the error rate by subgroup.
“The common line is that these methods do very well on average, the error rates are low, etc.,” Lockhart says. “And all of that is true. But what we found when we did a large survey with a bunch of covariates is that the error rates are not remotely even across groups. In some of the more dramatic examples, you see things like 43% of Chinese women are misgendered.”
By definition, automated gender inference was wrong for all 139 nonbinary scholars in the sample and the algorithm was wrong 3.5 times more often for women than men. These disparities can bias results and inferences in research, and misgendering and misclassification of race or ethnicity can significantly harm individuals, with ethical implications distributed unequally across groups.
Lockhart says a common response to the findings is better data and improved technology, but he emphasizes that there are fundamental cultural processes that produce unevenness that you cannot overcome mathematically. For example, someone may take their spouse’s last name, which could have different ethnic connotations than their maiden name. A Chinese name, when Romanized and written in English, loses Chinese characters and any diacritics or marks around the letters that offer tonal clues to a person’s gender.
The authors offer recommendations in their paper for both users of the ascription tools and developers. For those using the tools for name-based demographic inference, which of the five principles is most appropriate and practical depends on the nature of the data and the inquiry.
First, in cases where name-based demographic inference may not be theoretically or ethically justified, the authors recommend critical refusal; i.e., is this actually the right or the good thing to do? Second, if the perceived gender or race/ethnicity is of interest in the study, then measures of demographic inference are warranted — but be sure to align mechanism with method.
“These name-based tools are measuring the feel of a name; they're measuring vibes,” Lockhart says. “And if what you want to study is discrimination based on names, then you can design studies using these tools.”
Third, inference can be shaped to be specific to the researcher’s population of interest using domain expertise. Fourth, be cautious by deploying name-based imputation only for subgroups with high accuracy and consistency. And lastly, name-based demographic estimates can be used better in aggregate measures than individual classifications (if you're using a tool that is wrong for 50% of Chinese women, but you have no Chinese women in your population, it's not really an issue).
The recommendations for developers focus on driving accuracy and transparency. It’s best to report on aggregates, for example, because guesses about individual people are often wrong, but averages over large groups are closer to the true average. And the researchers also ask developers to report error rates by different subgroups.
Although the study rings the alarm around the use of ascription tools, Lockhart stress that the intention isn’t to erode trust in academic research.
“I don't want to undermine faith in research on discrimination and inequality,” Lockhart says. “Our goal is to suggest we use tools better. Use it in a way that is more accurate, more ethical, and gets you to ask more interesting research questions. Once you realize that the thing you're studying is not some boring transcendental, categorical A or B object, but a messy cultural, uneven process, you get a much more interesting and rich object of study.”
Part of the reason the team created the survey initially was to look at disparities around disability, name changes, sexual orientation, or parents’ education — things that can't be studied with these tools. Names aren't a perfect measure of something, but they're a measure of the cultural process of meaning-making and proving identity.
“If these tools are limited in any way, then maybe people ought to be doing surveys and other kinds of data collection that allow them to study a broader range of inequality,” Lockhart says. “We know inequality exists. If we want to do something interesting, we have to go beyond that and study the process of inequality. It's sort of a call for more and better research.