S1-205 -The Dunning-Kruger effect

What the famous finding actually claims, why much of it may be a statistical mirage, and what survives

In 1995, a moderately built man named McArthur Wheeler robbed two Pittsburgh banks in the middle of the day, without a disguise, apparently unconcerned that the security cameras would record his face. When police arrested him that evening and showed him the surveillance footage, he stared at it in bewilderment. “But I wore the juice,” he said. Wheeler had been told, and apparently genuinely believed, that rubbing lemon juice on one’s face would render it invisible to cameras, because lemon juice is used as an invisible ink and he had tested the idea by taking a Polaroid photograph of himself, which had not come out. He had concluded that his face was therefore undetectable, rather than that his camera was broken.

The story reached David Dunning, a psychologist at Cornell University, who found it less funny and more interesting than his colleagues did. The puzzle was not that Wheeler had a false belief. People hold false beliefs routinely, and usually revise them when confronted with contrary evidence. The puzzle was the quality of Wheeler’s confidence: the complete absence of any uncertainty, any sense that the plan might not work, any awareness that there was something he did not understand. He had not merely been wrong. He had been wrong in a way that made him incapable of recognizing that he was wrong, because the very knowledge that would have revealed the error was the knowledge he lacked.

Dunning, working with his graduate student Justin Kruger, designed a series of experiments to test whether this pattern (the coincidence of incompetence and overconfidence) was systematic rather than anecdotal. Their 1999 paper, published in the Journal of Personality and Social Psychology, reported results that have become among the most cited in social psychology.¹ The finding seemed elegant in its structure and uncomfortable in its implications. But there is a problem with the famous story, and the problem is worth understanding in detail because it is a small masterclass in how statistics can manufacture the appearance of psychology.

What the effect claimed

In their original study, Dunning and Kruger tested participants on three domains: logical reasoning, English grammar, and the ability to recognize what is funny. In each domain, they measured actual performance and then asked participants to estimate their own performance relative to others. The headline finding had two parts. At the bottom, people who scored worst dramatically overestimated themselves: those in the lowest quarter typically guessed they had beaten about 60 percent of their peers. At the top, the best scorers slightly underestimated their relative standing, guessing around 70 percent when their actual performance placed them higher.

The interpretation offered was a “dual burden”: the skills needed to do well at something are often the same skills needed to judge how well one has done, so the least competent are doubly cursed (bad at the task and bad at knowing it), while the most competent, comfortable with the material, assume others find it as easy as they do and so underrate their relative edge.

It is a compelling story, and it spread far beyond the original modest claim. The viral version (“the incompetent are confidently deluded, the experts are humble”) is the one most people carry. That version is largely wrong, and the reasons are worth understanding because they illuminate something important about the relationship between data and interpretation.

The first problem: regression to the mean

Consider a thought experiment that strips out psychology entirely. Suppose self-assessment had no relationship at all to actual skill: everyone simply guessed their own rank more or less at random, clustered loosely around the middle. Now sort people by their real test score and look at each group’s average guess.

The people who actually scored at the very bottom guessed more or less randomly, which means their guesses sit, on average, somewhere near the middle, far above where they actually scored. They appear to massively overestimate themselves. The people who actually scored at the very top also guessed near the middle on average, which is now far below where they actually scored. They appear to underestimate themselves. Both halves of the Dunning-Kruger pattern have just emerged from data containing no psychology whatsoever.

This is regression to the mean. The extremes can only err toward the center, because there is nowhere else for them to go: if one is genuinely at the bottom, the only direction the estimate can be wrong is upward; if one is at the top, the only direction is downward. Any noisy, imperfect self-estimate will therefore produce overestimation at the bottom and underestimation at the top automatically, whether or not anything interesting is happening in anyone’s head.

The second problem: autocorrelation

The sharper criticism, developed most fully by Edward Nuhfer and colleagues in 2016 and 2017, concerns how the famous chart is drawn. The Dunning-Kruger graph plots self-assessment error on one axis against actual score on the other. But self-assessment error is calculated as the guess minus the actual score. That score is therefore baked into both axes: the thing on the vertical axis contains the thing on the horizontal axis by construction.

When a quantity is plotted against something that is mathematically part of it, a slope appears automatically, even from pure random numbers. Critics demonstrated exactly this: feed randomly generated data into the Dunning-Kruger procedure and the signature chart appears every single time. A pattern that shows up reliably in random noise cannot, on its own, be evidence of a psychological effect. It is evidence about the arithmetic of the chart. This is the argument summarized in the phrase “the Dunning-Kruger effect is autocorrelation.”

Why the effect is larger at the bottom

A natural objection arises here, and it is the most clarifying question in the whole topic. If the mechanism were only the boundary (that the bottom can err only upward and the top only downward), then the effect should be symmetric: equal and opposite distortions at the two ends. So why is the overestimation at the bottom so much larger and more dramatic than the underestimation at the top? Why is the effect lopsided?

The boundary alone does not explain this. Something has to break the symmetry, and the main culprit is a second, independent and well-established bias: the better-than-average effect, the robust tendency of most people, across most desirable traits, to rate themselves a little above average. This is the same effect behind the finding that the large majority of drivers rate themselves as better-than-average drivers.³ Crucially, this bias is not symmetric around the middle; it is a general upward shove on everyone’s self-estimate. Consider what that upward shove does at each end when it is added to regression:

At the bottom, regression pulls the low scorers’ estimates upward toward the middle, and the better-than-average bias also pushes them upward. The two effects point the same way and stack. The result is a large overestimate.

At the top, regression pulls the high scorers’ estimates downward toward the middle, but the better-than-average bias pushes them upward. The two effects point in opposite directions and partly cancel. The result is only a small apparent underestimate.

So the asymmetry is not produced by the ceiling and floor, which are symmetric. It is produced by a uniform upward bias that compounds the regression error at the bottom and offsets it at the top. The bottom group gets hit by both barrels in the same direction; the top group gets the two barrels fighting each other. That alone reproduces the lopsided Dunning-Kruger shape, with no psychology beyond the single, uncontested fact that people tend to rate themselves slightly high.

Do experts really underestimate themselves?

This is the question that makes the whole dispute concrete, and the answer is mostly no. The apparent humility of high performers is the easiest part of the effect to explain away, for a simple reason: when one is genuinely near the top, there is almost no room to overestimate. No one can claim to be in the 130th percentile. Someone truly at the 95th percentile who modestly guesses “the 80th, maybe” looks like they are underestimating themselves, but the appearance is largely forced by the boundary, not by any special self-awareness or modesty. There is simply far more room to be wrong in the downward direction when one is already high up.

When researchers analyze the data with methods designed to remove the regression and boundary artifacts (by measuring ability twice, or using more rigorous nonlinear techniques), the dramatic “experts underrate themselves” effect mostly shrinks or disappears. So the strong claim, that expertise produces systematic self-underestimation as a psychological fact, does not robustly hold. It is mostly an artifact of where the top sits on the scale.

What survives

If the dramatic version is mostly artifact, is there anything real left? Probably a weaker, gentler effect, though even this is debated. What tends to survive the more careful analyses is not a story about confident fools and humble experts, but something more modest: self-assessment accuracy improves with skill. Better performers tend to judge their own standing more accurately (their estimates track reality more closely), while poorer performers’ estimates are noisier, less tightly connected to how they actually did. This is the genuine residue of Dunning and Kruger’s “dual burden” idea: the metacognitive skill needed to judge one’s performance in a domain and the skill needed to perform in it do seem to grow together, so people weak in a domain are also somewhat worse at gauging their weakness.

Note how much smaller this claim is than the famous one. “Poor performers have noisier self-assessment” is real, plausible, and survives scrutiny. “The incompetent are confidently deluded while experts are humble” is the part that turns out to be largely manufactured by regression to the mean, autocorrelation, and the better-than-average effect.

How settled is this?

It is contested, not closed, and it is worth being precise about the state of the argument rather than swapping one slogan (“the incompetent don’t know it”) for another (“Dunning-Kruger is debunked”).

What is essentially agreed: the classic quartile chart is contaminated by regression to the mean and by autocorrelation, and the dramatic symmetric story is, to a large degree, a statistical artifact. The critiques on this point (Krueger and Mueller in 2002, Nuhfer and colleagues in 2016 and 2017, Gignac and Zajenkowski in 2020) are technically strong and not seriously disputed as mathematics.²

What remains in dispute: whether anything psychological survives once the artifacts are removed. Some analysts argue essentially nothing does: the whole pattern reduces to regression plus the better-than-average effect. Others argue a real but modest residue persists, that poor performers genuinely have less accurate self-knowledge, and a number of studies using stricter methods still report partial effects in some domains. A 2024 study of creativity captured the situation neatly: the effect appeared under the classic quartile method and failed to appear under more rigorous methods. That is the fingerprint of a result that is mostly, but perhaps not entirely, an artifact.

So the honest verdict: the Dunning-Kruger effect as popularly understood, and as originally charted, is substantially contested and largely explicable as a statistical artifact; a weaker, real phenomenon (self-assessment accuracy rising with skill) may survive, but it is far less dramatic than the version that became famous, and the question is not settled.

The mountain metaphor, and why it remains useful

There is an image that captures the mechanism Dunning and Kruger proposed, and it remains pedagogically useful even if the quantitative claims are contested. Imagine a mountain range, seen from below. Standing at the base, one can see the foothills clearly, and beyond them, seemingly close, what appears to be the summit. There is no sense of the scale of what lies between the base and the top, because it cannot be seen from that vantage point. The difficulty of the climb is therefore estimated from what is visible, which produces a systematic underestimate.

Now imagine the same mountain range, seen from a position partway up. Suddenly the foothills look much further below than expected. The summit looks much further above. The terrain between that position and the top is more visible, and it is more difficult than it appeared from the base. The estimate of the total difficulty of the climb, and of how far one has come, has been dramatically revised, not because the mountain has changed, but because the new vantage point gives access to information that was previously invisible.

This is a geometric rather than a psychological description of what Dunning and Kruger were trying to capture. The novice’s overconfidence, to the extent it exists beyond the statistical artifacts, is not primarily a failure of character; it is a failure of vantage point. From the base of a domain of knowledge, the structure of the domain is invisible. The divisions, sub-fields, methodological debates, unresolved questions, and historical complexity that constitute actual expertise are not visible because the vantage point does not reach them. The novice constructs an estimate of the domain’s difficulty from what they can see, which is the entry-level material that superficially resembles a summary of the whole.⁴

The metaphor survives the statistical critique because it describes something real about the phenomenology of learning, even if the original quantitative claims were inflated by artifacts. Whether or not the Dunning-Kruger chart measures what it purports to measure, the experience of discovering that a domain is vastly more complex than it appeared from the outside is familiar to anyone who has seriously entered a field of study. The mountain was always there. What changed was the capacity to perceive it.

The valley of despair

There is a further stage in the progression that the original Dunning-Kruger paper did not fully emphasize but that subsequent research and the broader learning literature have made clear. As knowledge in a domain deepens (as the novice begins to see more of the mountain), confidence typically falls before it rises again. The early learner who confidently asserts that economics is simple, or that a legal question has an obvious answer, or that a programming problem is straightforward, typically passes through a period of uncomfortable uncertainty as the domain’s actual complexity becomes visible. The complexity was always there. What has changed is the learner’s capacity to perceive it.

This is sometimes called the valley of despair, and it is a predictable feature of the learning curve rather than a sign that learning has gone wrong. It is the moment when the novice’s early map has grown just accurate enough to reveal the terrain it was missing, when the provisional model that made the domain seem manageable has given way to a fuller view that makes it seem overwhelming. The expert who has passed through the valley and emerged with genuine competence typically carries a residue of that encounter: a calibrated uncertainty that the novice’s early confidence lacks, not because the expert knows less but because they know more about what there is to know.

Article 212 of this series describes several practical tools for gaining a more accurate sense of the terrain above one’s current position, for improving one’s vantage point without requiring the climb itself. But what the current article contributes is the prior observation: whether or not the quantitative Dunning-Kruger claims survive statistical scrutiny, the phenomenology of encountering complexity (of watching one’s confidence drop as one’s knowledge grows) is real and recognizable to anyone who has seriously pursued expertise in any domain.

The irony worth keeping

There is a final point that is more than a curiosity. The reason the Dunning-Kruger effect fooled people for nearly two decades, including capable researchers, is precisely that a piece of statistical structure (regression to the mean, a variable plotted against itself) produced something that looked like a deep fact about human psychology. People saw a distribution and read it as a personality. The effect became a household name on the strength of an artifact.

That is a sharper lesson than the one the effect is usually wheeled out to teach. The popular use of Dunning-Kruger is to point at someone else’s overconfidence. The better lesson is about how easily anyone, expert included, can mistake the shape of their data for a discovery about people, which is a humility the confident citers of the effect rarely extend to themselves.

The Conscious Look, applied to calibration

The diagnostic question of this series (what would have to be true for this model to be wrong?) has a specific and important application here. If even skilled researchers can be misled by statistical artifacts into believing they have discovered a psychological phenomenon, then the rest of us are not immune. The question is not whether we are susceptible to miscalibration (the evidence suggests we all are, in various ways) but whether we can develop practices that reduce the distortion.

The first practice is the Feynman test described in article 212: attempt to explain the subject without using its technical vocabulary, in terms that a curious and intelligent non-specialist could follow. The gaps in the explanation reveal the gaps in the understanding, in a way that is at least partially accessible to the person performing the test. If the explanation collapses into jargon when pressed (if the technical vocabulary is doing the work that genuine understanding should be doing), then the gap is visible, even if it is uncomfortable to acknowledge.

The second is to treat the valley of despair as a navigational signal rather than a motivational obstacle. When increasing engagement with a domain produces increasing uncertainty rather than increasing confidence, the uncertainty is probably informative. It is the vantage point shifting upward on the mountain, and the loss of confidence it produces is evidence of genuine progress rather than evidence that the domain is not worth pursuing. The novice who expects to feel more confident as they learn more will be surprised and discouraged by the valley. The learner who expects the valley (who knows that the encounter with genuine complexity is the first sign that the map is becoming more accurate) will receive the discomfort as information.

The Dunning-Kruger effect, in its original formulation, claimed that the model of one’s own competence is systematically distorted in predictable ways. The statistical critique reveals that much of the evidence for that claim was itself a distortion, a case of researchers’ models of their data being systematically wrong. The irony is perfect, and it is the kind of irony this series exists to notice: the study of overconfidence was itself overconfident. The map of miscalibration was itself miscalibrated.

What survives this recursion is not nothing. The weaker claim (that self-assessment accuracy improves with skill, that novices are noisier in their self-estimates than experts) is probably real. The phenomenology of the valley of despair is real. The mountain metaphor captures something true about the experience of learning. And the lesson about statistical artifacts is itself valuable: if a finding this famous, this replicated, this intuitively satisfying can turn out to be largely an artifact of how the chart was drawn, then the bar for confidence in any psychological claim should be higher than most of us set it.

The Conscious Look, applied to self-assessment, is therefore the practice of treating one’s own confidence as data rather than as ground truth: asking what evidence underlies it, what evaluative capacity generated it, and what would be required to distinguish genuine competence from the appearance of competence that regression to the mean, the better-than-average effect, and the limits of one’s own vantage point can so easily produce.

Notes

¹ The McArthur Wheeler story is reported in Kruger, J., and Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121-1134. Dunning has described Wheeler’s case as the original inspiration for the research program. The specific claim about lemon juice as invisible ink is accurate: lemon juice does function as a heat-sensitive invisible ink, which apparently led Wheeler to a dramatically incorrect inference about its properties as a facial concealer for cameras.

² The statistical-artifact arguments (regression to the mean, autocorrelation, better-than-average stacking) are well established and technically uncontested; what remains genuinely disputed is whether a real psychological residue survives. The key critical papers are: Krueger, J., and Mueller, R. A. (2002). Unskilled, unaware, or both? The better-than-average heuristic and statistical regression predict errors in estimates of own performance. Journal of Personality and Social Psychology, 82(2), 180-188; Nuhfer, E., et al. (2016, 2017), two papers in Numeracy arguing the effect can be reproduced from random data; Gignac, G. E., and Zajenkowski, M. (2020). The Dunning-Kruger effect is (mostly) a statistical artefact: Valid approaches to testing the hypothesis with individual differences data. Intelligence, 80. Note also the later exchange (Hiller 2023; Gignac and Zajenkowski 2023) keeping the matter live.

³ The better-than-average effect is documented extensively. The classic demonstration is Svenson, O. (1981). Are we all less risky and more skillful than our fellow drivers? Acta Psychologica, 47, 143-148. For the mechanisms (self-enhancement plus egocentric/focalism accounts), see Alicke, M. D., and Govorun, O. (2005). The better-than-average effect. In The Self in Social Judgment; and Kruger, J. (1999). Lake Wobegon be gone! Journal of Personality and Social Psychology, 77, on the easy/hard reversal.

⁴ The relationship between partial familiarity and overconfidence is related to what the psychologist Matthew Fisher and colleagues have called the illusion of explanatory depth: the finding that people who have been exposed to simplified summaries of complex subjects are more confident in their understanding than people who have received no information at all, despite having no more actual understanding. This connects to the series’ treatment of the knowledge illusion in article 202.

S1-205 -The Dunning-Kruger effect

What the famous finding actually claims, why much of it may be a statistical mirage, and what survives

What the effect claimed

The first problem: regression to the mean

The second problem: autocorrelation

Why the effect is larger at the bottom

Do experts really underestimate themselves?

What survives

How settled is this?

The mountain metaphor, and why it remains useful

The valley of despair

The irony worth keeping

The Conscious Look, applied to calibration

Further reading

Notes

Leave a reply Cancel reply

S1-205 -The Dunning-Kruger effect

What the famous finding actually claims, why much of it may be a statistical mirage, and what survives

What the effect claimed

The first problem: regression to the mean

The second problem: autocorrelation

Why the effect is larger at the bottom

Do experts really underestimate themselves?

What survives

How settled is this?

The mountain metaphor, and why it remains useful

The valley of despair

The irony worth keeping

The Conscious Look, applied to calibration

Further reading

Notes

Share this:

Leave a reply Cancel reply