Regression toward the mean
, in statistics
, is the phenomenon whereby members of a population
with extreme values on a given measure for one observation will, for purely statistical reasons, probably give less extreme measurements on other occasions when they are observed.
Consider a simple example: A class of students takes a 100-item true/false test on a subject. Suppose that all students choose randomly on all questions. Then, each student's score would be a single instance of one of a set of independent and identically-distributed random variables, with a mean of 50. Naturally, some students will score substantially above 50 and some substantially below 50 just by chance. If one takes only the top-scoring 10% of the students and gives them a second test on which they again guess on all items, the mean score would again be expected to be close to 50. Thus, the mean of these students would "regress" all the way back to the mean of all students who took the original test. No matter what a student scores on the original test, the best prediction of their score on second test is 50.
If there were no luck or random guessing involved in the answers supplied by students to the test questions, then all students would score the same on the second test as they scored on the original test, and there would be no regression toward the mean.
Most realistic situations fall between these two extremes: for example, one might consider exam scores as a combination of skill and luck. In this case, the subset of students scoring above average would have more lucky students than unlucky ones. On a retest of this subset, the lucky students would have average luck, which would be worse than before, and therefore lower scores.
The earliest example of regression toward the mean was perhaps Galton's diagram in his article in Journal of the Anthropological Institute, vol. 15 (1886), p.248). It concerned the "Rate of Regression in Hereditary Stature", and compared children's height with their mid-parent height. (All heights are expressed as deviations from the median.) He summarises this results by putting in the diagram:
- When mid-parents are taller than mediocrity (by which he means the median), their children tend to be shorter than they
- When mid-parents are shorter than mediocrity, their children tend to be taller than they
which is as good a summary of the "regression effect" as we are likely to get.
Galton also drew the first regression line. This was a plot of seed weights and was presented at a Royal Institution lecture in 1877. Galton had seven sets of sweet pea seeds labeled K to Q and in each packet the seeds were of the same weight. He chose sweet peas on the advice of his cousin Charles Darwin and the botanist Joseph Dalton Hooker as sweet peas tend not to self fertilise and the seed weight varies little with humidity. He distributed these packets to a group of friends throughout Great Britain who planted them. At the end of the growing season the plants were uprooted and returned to Galton. The seeds were distributed because when Galton had tried this experiment himself in the Kew Gardens in 1874, the crop had failed.
He found that the weights of the offspring seeds were normally distributed, like their parents, and that if he plotted the mean diameter of the offspring seeds against the mean diameter of their parents he could draw a straight line through the points — the first regression line. He also found on this plot that the mean size of the offspring seeds tended to the overall mean size. He initially referred to the slope of this line as the "coefficient of reversion". Once he discovered that this effect was not a heritable property but the result of his manipulations of the data, he changed the name to the "coefficient of regression". This result was important because it appeared to conflict with the current thinking on evolution and natural selection. He went to do extensive work in quantitative genetics and in 1888 coined the term "co-relation" and used the now familiar symbol "r" for this value.
In additional work he investigated geniuses in various fields and noted that their children, while typically gifted, were almost invariably closer to the average than their exceptional parents. He later described the same effect more numerically by comparing fathers' heights to their sons' heights. Again, the heights of sons both of unusually tall fathers and of unusually short fathers was typically closer to the mean height than their fathers' heights.
It is important to realize that regression toward the mean is unrelated to the progression of time - in Galton's example, the fathers
of exceptionally tall sons also tend to be closer to the mean than their sons. It is an artifact of choosing a non-representative sample in one variable and examining the properties of this sample in another.
The original version of regression toward the mean suggests an identical trait with two correlated measurements with the same reliability. However, this character is not necessary, unless any pair of predicting and predicted variables had to be viewed with an identical potential trait. The necessary implicate presumption is that the standard deviations of the predicting and the predicted are the same to be comparable, or have been transformed or interpreted to be comparable.
One later version of regression toward the mean defines a predicting variable with measurement error which impairs the predicting coefficient. This interpretation is not necessary. For example, in the original case the measurement error of length could be ignored.
Another way to understand regression towards the mean is the phrase "nowhere to go but up/down": If you look at any exceptional group, some related group will likely be less exceptional due to the effect of regression towards the mean. For example, if you looked at the families with below-average incomes today, and compared them to how they were doing 5 years ago, you would see that they were likely doing better earlier and conclude that things are getting worse, Also if you looked at the families with the below-average incomes 5 years ago, you would likely find that they are doing better now, and conclude that things are getting better. This seeming contradiction is an artifact of the regression towards the mean. Since below average families can only remain in the same below average group or change to the above average group, the effect of that change will be to pull the statistic towards the mean. Using a non-representative sample biases the results away from the mean.
If random variables
have standard deviations
and correlation coefficient
of the regression line via least squares method
is given by
Since the population variance remains unchanged in the next generation and midparent variance is half that of the population, the slope now become . Note that regression toward the mean is more pronounced the less the two variables are correlated, i.e. the smaller is. If , then regression towards mean will not occur as the slope .If