For decades, the bedrock of psychometric evaluation has rested upon the assumption that cognitive ability can be accurately distilled into a series of binary outcomes: a response is either correct or incorrect. However, a groundbreaking study recently published in the Journal of Political Economy suggests that this fundamental methodology may be deeply flawed, inadvertently penalizing individuals—disproportionately women—who possess a nuanced understanding of their own uncertainty. Researchers Glenn W. Harrison and J. Todd Swarthout of Georgia State University, alongside Don Ross of University College Cork, have demonstrated that when the rigid constraints of traditional multiple-choice testing are replaced with a system that allows for the expression of subjective confidence, the long-standing "gender gap" in intelligence and financial literacy not only narrows but, in some cases, reverses entirely.
The study challenges the prevailing "forced-choice" paradigm that dominates everything from elementary school standardized tests to the SAT, GRE, and professional licensure exams. By requiring test-takers to commit to a single answer, these assessments fail to distinguish between a person who is certain of the correct answer and one who is merely guessing between two likely options. This lack of nuance, the researchers argue, overlooks a critical component of human cognition: the ability to accurately calibrate one’s own beliefs against objective reality.
The Architecture of Cognitive Measurement
To investigate the impact of test structure on performance, the research team focused on the Raven Advanced Progressive Matrices (RAPM). Since its development in the 1930s, the Raven’s test has been regarded as a gold standard for measuring "fluid intelligence"—the capacity to think logically and solve problems in novel situations, independent of acquired knowledge or cultural background. The test typically presents a 3×3 grid of geometric patterns with one cell left blank; the participant must identify the missing piece from eight possible options.
In a traditional setting, the RAPM is scored based on the number of correct answers provided within a set timeframe. This format, the researchers note, treats all "incorrect" answers as equal failures, regardless of whether the test-taker was 90% sure or 12.5% sure (a random guess) of their choice. To correct for this, the team designed a computerized experiment that introduced two major variables: financial incentives and a "token" allocation system.
The study involved participants divided into three distinct experimental conditions. The first, a baseline group, completed a traditional version of the Raven’s test for a flat participation fee of five dollars. The second group was paid based on their accuracy but remained restricted to the traditional "one-answer" format. The third group utilized a "belief elicitation" framework. These participants were given 80 digital tokens per puzzle and were permitted to distribute them across the eight possible answers based on their level of confidence.
If a participant was entirely certain of an answer, they could place all 80 tokens on a single choice, yielding a maximum payout of two dollars if correct. Conversely, if they were torn between two options, they could place 40 tokens on each, securing a smaller but guaranteed reward if either was correct. This mechanism transformed the test from a simple logic puzzle into a sophisticated exercise in risk management and subjective probability.
Reversing the Gender Narrative
The most striking findings emerged when the researchers analyzed the performance of the third group. Under the traditional "forced-choice" format, male participants generally scored higher than female participants, reinforcing historical (though often contested) data regarding gender differences in spatial and fluid reasoning. However, when the ability to express uncertainty was introduced alongside financial incentives, the hierarchy shifted.
Women outperformed men when they were allowed to hedge their bets using the token system. The data revealed that female participants were significantly more adept at identifying when they did not know the answer and distributing their tokens efficiently to mitigate risk. In contrast, men tended to exhibit "overprecision"—a form of overconfidence where they placed too many tokens on a single, often incorrect, answer.
This suggests that the traditional "gender gap" in intelligence testing may not be a reflection of a gap in raw cognitive ability, but rather a gap in how different genders navigate the "all-or-nothing" risk inherent in traditional testing. When the test environment rewarded accurate risk assessment—a trait the researchers consider a fundamental element of intelligence—women proved to be the more effective cognitive agents.
The Chronology of Discovery: From Logic to Literacy
Building on the results of the Raven’s test, the researchers expanded their investigation into two other areas where gender disparities are frequently cited: workplace competitiveness and financial literacy.
In the realm of competitiveness, behavioral economics has long suggested that women are more "risk-averse" and less likely to enter competitive environments, such as "winner-take-all" tournaments. The researchers recreated these scenarios using their token system. They found that men frequently opted for competitive structures even when the mathematical probability of winning was low, leading to overall financial losses. Women, by contrast, accurately evaluated the risks and chose more stable compensation structures. What had previously been labeled as a "lack of confidence" or "competitiveness" in women was revealed to be superior mathematical risk management.
The team then applied this lens to financial literacy. Standard surveys often include a "do not know" option for complex questions about inflation, interest rates, and diversification. Historically, women select "do not know" at higher rates than men, leading to the conclusion that women are less financially literate.
However, when the researchers allowed participants to use the token system for these questions, the "literacy gap" largely evaporated. The results showed that women and men possessed similar levels of underlying knowledge. The difference was that women were more honest and aware of their own uncertainty. While a man might guess a single answer and happen to be right (or wrong), a woman was more likely to acknowledge that she was only 60% sure, distributing her tokens accordingly.
Implications for Education and the Workforce
The implications of this research are far-reaching, suggesting that current methods for vetting students and employees may be systematically selecting for overconfidence rather than actual competence.
"The measurement of intelligence should identify and measure an individual’s subjective confidence that a response to a test question is correct," the authors wrote. By failing to do so, institutions may be overlooking candidates who possess "intellectual humility"—the ability to recognize the limits of their own knowledge.
In a professional context, the researchers pointed out that the "overconfident" profile (typically associated with higher scores in traditional formats) is actually the most dangerous in high-stakes environments like finance or medicine. An individual who is 100% confident in a wrong answer is far more likely to cause a catastrophic failure than someone who knows they are guessing and seeks a second opinion or additional data.
Environmental Clues and Cognitive Scaffolding
The study also delved into how the structure of a test provides "environmental clues" that aid performance. The Raven’s test is traditionally organized in a "structured progression," starting with very easy puzzles and gradually increasing in difficulty. This sequence acts as a form of "scaffolding," helping the test-taker learn the logic of the test as they go.
The researchers experimented by scrambling the order of the puzzles, presenting easy and difficult tasks randomly. As expected, overall performance dropped. However, the performance gap between the "forced-choice" group and the "token" group widened even further in the scrambled version. This indicates that the ability to express uncertainty is an even more significant cognitive advantage when facing unpredictable or disorganized problems—the very types of problems most common in the real world.
Future Research and Broader Demographics
While the gender findings have garnered the most immediate attention, the researchers noted that their data suggests similar patterns in other demographic groups. Initial findings indicated that Black participants also showed a marked improvement in performance when allowed to use the token system compared to traditional formats. This raises the possibility that many "achievement gaps" currently attributed to socioeconomic or cultural factors may, in part, be artifacts of the testing format itself.
The authors cautioned that further research is needed to isolate personal motivations from financial ones. Some participants may bring "intrinsic" motivations—such as a desire to "beat the test" or a fear of appearing uncertain—that interact with the monetary rewards in ways that are difficult to quantify in a laboratory setting.
The study, "Gender, Confidence, and the Mismeasure of Intelligence, Competitiveness, and Literacy," serves as a call to action for the psychometric community. If the goal of testing is to identify the individuals best equipped to navigate a complex and uncertain world, then the tests themselves must evolve to reward the precision of belief rather than the boldness of a guess. As Harrison, Ross, and Swarthout conclude, knowing what you do not know is not a sign of weakness; it is a hallmark of intelligence.







