Simpson's paradox (or the Yule-Simpson effect) is a statistical paradox wherein the successes of groups seem reversed when the groups are combined. This result is often encountered in social and medical science statistics, and occurs when frequency data are hastily given causal interpretation; the paradox disappears when causal relations are derived systematically, through formal analysis.
Though mostly unknown to laymen, Simpson's Paradox is well known to statisticians, and is described in several introductory statistics books. Many statisticians believe that the mainstream public should be apprised of counterintuitive results such as Simpson's paradox, in particular to caution against the inference of causal relationships based on the association between two variables.
Edward H. Simpson described the phenomenon in 1951, along with Karl Pearson et al., and Udny Yule in 1903. The name Simpson's paradox was coined by Colin R. Blyth in 1972. Since Simpson did not discover this statistical paradox, some authors, instead, have used the impersonal names reversal paradox and amalgamation paradox in referring to what is now called Simpson's Paradox and the Yule-Simpson effect.
In both 1995 and 1996, Justice had a higher batting average (in bold) than Jeter; however, when the two years are combined, Jeter shows a higher batting average than Justice. According to Ross, this phenomenon would be observed about once per year among the interesting baseball players. In this particular case, the paradox can still be observed if the year 1997 is also taken into account:
The first table shows the overall success rates and numbers of treatments for both treatments (where Treatment A includes all open procedures and Treatment B is percutaneous nephrolithotomy):
|Treatment A||Treatment B|
|78% (273/350)||83% (289/350)|
This seems to show treatment B is more effective. If we include data about kidney stone size, however, the same set of treatments reveals a different answer:
|Treatment A||Treatment B|
|Small Stones|| Group 1|
| Group 2|
|Large Stones|| Group 3|
| Group 4|
|Both||78% (273/350)||'''83% (289/350)|
The information about stone size has reversed our conclusion about the effectiveness of each treatment. Now treatment A is seen to be more effective in both cases. In this example the lurking variable (or confounding variable) of stone size was not previously known to be important until its effects were included.
Which treatment is considered better is determined by an inequality between two ratios (successes/total). The reversal of the inequality between the ratios, which creates Simpson's paradox, happens because two effects occur together:
However when examining the individual departments, it was found that no department was significantly biased against women; in fact, most departments had a small bias against men.
|Applicants||% admitted||Applicants||% admitted|
The explanation turned out to be that women tended to apply to competitive departments with low rates of admission even among qualified applicants (such as English), while men tended to apply to less-competitive departments with high rates of admission among qualified applicants (such as engineering). The conditions under which department-specific frequency data constitute a proper defense against charges of discrimination are formulated in Pearl (2000).
In July 2006, the United States Department of Education released a study documenting student performances in reading and math in different school settings. It reported that while the math and reading levels for students at grades 4 and 8 were uniformly higher in private/parochial schools than in public schools, repeating the comparisons on demographic subgroups showed much smaller differences, which were nearly equally divided in direction.
The low birth weight paradox is an apparently paradoxical observation relating to the birth weights and mortality of children born to tobacco smoking mothers. Traditionally, babies weighing less than a certain amount (which varies between countries) have been classified as having low birth weight. In a given population, low birth weight babies have a significantly higher mortality rate than others. However, it has been observed that low birth weight children born to smoking mothers have a lower mortality rate than the low birth weight children of non-smokers.
Suppose two people, Lisa and Bart, each edit Wikipedia articles for two weeks. In the first week, Lisa improves 60 percent of the articles she edits out of 100 articles edited, and Bart improves 90 percent of the articles he edits out of 10 articles edited. In the second week, Lisa improves just 10 percent of the articles she edits out of 10 articles edited, while Bart improves 30 percent yet out of 100 articles edited.
|Week 1||Week 2||Total|
Both times, Bart improved a higher percentage of the quantity of articles compared to Lisa, while Lisa improved a higher percentage of the quality of articles. When the two tests are combined using a weighted average, overall, Lisa has improved a much higher percentage than Bart because the quality modifier had a significantly higher percentage. Therefore, like other paradoxes, it only appears to be a paradox because of incorrect assumptions, incomplete or misguided information, or a lack of understanding a particular concept.
|Week 1 quantity percentage||Week 2 quantity percentage||Total quantity and weighted quality percentage|
Here are some notations:
On both occasions Bart's edits were more successful than Lisa's. But if we combine the two sets, we see that Lisa and Bart both edited 110 articles, and:
Bart is better for each set but worse overall.
The paradox stems from our healthy intuition that Bart could not possibly be a better editor on each set but worse overall. Pearl (2000) in fact proved the impossibility of such happening, where "better editor" is taken in the counterfactual sense: "Were Bart to edit all items in a set he would do better than Lisa would, on those same items." Clearly, frequency data cannot support this sense of "better editor," because it does not tell us how Bart would perform on items edited by Lisa, and vice versa. In the back of our mind, though, we assume that the articles were assigned at random to Bart and Lisa, an assumption which (for large sample) would support the counterfactual interpretation of "better editor." However, under random assignment conditions, the data given in this example is impossible, which accounts for our surprise when confronting the rate reversal.
The arithmetical basis of the paradox is uncontroversial. If and we feel that must be greater than . However if different weights are used to form the overall score for each person then this feeling may be disappointed. Here the first test is weighted for Lisa and for Bart while the weights are reversed on the second test.
By more extreme reweighting Lisa's overall score can be pushed up towards 60% and Bart's down towards 30%.
Lisa is a better editor on average, as her overall success rate is higher. But it is possible to have told the story in a way which would make it appear obvious that Bart is more diligent.
Simpson's paradox shows us an extreme example of the importance of including data about possible confounding variables when attempting to calculate causal relations. Precise criteria for selecting a set of "confounding variables," (i.e., variables that yield correct causal relationships if included in the analysis), is given in (Pearl, 2000) using causal graphs.
While Simpson's paradox often refers to the analysis of count tables, as shown in this example, it also occurs with continuous data: for example, if one fits separated regression lines through two sets of data, the two regression lines may show a positive trend, while a regression line fitted through all data together will show a negative trend, as shown on the picture above.
Simpson's paradox can also be illustrated using the 2-dimensional vector space. A success rate of can be represented by a vector , with a slope of . If two rates and are combined, as in the examples given above, the result can be represented by the sum of the vectors and , which according to the parallelogram rule is the vector , with slope .
Simpson's paradox says that even if a vector (in blue in the figure) has a smaller slope than another vector (in red), and has a smaller slope than , the sum of the two vectors (indicated by "+" in the figure) can still have a larger slope than the sum of the two vectors , as shown in the example.
For a brief history of the origins of the paradox see the entries "Simpson's Paradox and Spurious Correlation" in