The statistic is defined as the sum of the products of the standard scores of the two measures divided by the degrees of freedom. If the data comes from a sample, then
where
are the standard score, sample mean, and sample standard deviation (calculated using n − 1 in the denominator).
If the data comes from a population, then
where
are the standard score, population mean, and population standard deviation (calculated using n in the denominator).
The result obtained is equivalent to dividing the covariance between the two variables by the product of their standard deviations.
The coefficient ranges from −1 to 1. A value of 1 shows that a linear equation describes the relationship perfectly and positively, with all data points lying on the same line and with Y increasing with X. A score of −1 shows that all data points lie on a single line but that Y increases as X decreases. A value of 0 shows that a linear model is inappropriate – that there is no linear relationship between the variables.
The linear equation that best describes the relationship between X and Y can be found by linear regression. This equation can be used to "predict" the value of one measurement from knowledge of the other. That is, for each value of X the equation calculates a value which is the best estimate of the values of Y corresponding the specific value. We denote this predicted variable by Y'.
Any value of Y can therefore be defined as the sum of Y′ and the difference between Y and Y′:
The variance of Y is equal to the sum of the variance of the two components of Y:
Since the coefficient of determination implies that sy.x2 = sy2(1 − r2) we can derive the identity
The square of r is conventionally used as a measure of the association between X and Y. For example, if r2 is 0.90, then 90% of the variance of Y can be "accounted for" by changes in X and the linear relationship between X and Y.