Linear regression

Linear regression

In statistics, linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line. The results are subject to statistical analysis.

Definitions

The data consist of n values x_{i,1}, ldots, x_{i,n} for each of the m independent variables (explanatory variables) x_{i} , (i = 1, ..., m), and n values, y_1, ldots, y_n of the dependent variable y (response variable). The independent variables are also called regressors, exogenous variables, covariates, input variables or predictor variables.

In general there are m parameters to be determined, beta_1, ldots, beta_m . The model is a linear combination of these parameters,

y_i = sum_{j=1}^{m} X_{ij} beta_j + varepsilon_i.
Here, X includes, typically, a constant, that is, a column which does not vary across observations, and the independent variables or functions of the independent variables. varepsilon denotes the error term. Models which do not conform to this specification may be treated by nonlinear regression.

In simple linear regression the data model represents a straight line and can be written as

y_i = beta_1 + x_i beta_2 + varepsilon_i
m = 2, beta_1 (intercept) and beta_2 (slope) are the parameters of the model and the coefficients are X_{i1}=1 and X_{i2}=x_i.

A linear regression model need not be a linear function of the independent variable: linear in this context means that the conditional mean of y is linear in the parameters beta. For example, the model y=beta_1+beta_2 x+beta_3 x^2+varepsilon is linear in the parameters, beta_1, beta_2 and beta_3,, but it is not linear in x, as X_{i3}=x_i^2, a nonlinear function of xi. An illustration of this model is shown in the example, below.

If there is any linear dependence amongst the columns of X, then the parameters β cannot be estimated by least squares unless β is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of β may still be uniquely estimable in such cases. For example, the model

y_i=beta_1+x_ibeta_2+2x_ibeta_3+varepsilon_i,
cannot be solved for beta_2 and beta_3 independently as the matrix of coefficients, X, has the reduced rank 2. In this case the model can be rewritten as
y_i=beta_1+x_i(beta_2+2beta_3)+varepsilon_i,
and can be solved to give values for hatbeta_1 and the composite entity (widehat{beta_2+2beta_3}).

Classical assumptions for linear regression include the assumptions that the sample is selected at random from the population of interest, that the dependent variable is continuous on the real line, and that the error terms follow identical and independent normal distributions, that is, that the errors are i.i.d. and Gaussian. Note that these assumptions imply that the error term does not statistically depend on the values of the independent variables, that is, that epsilon is statistically independent of the predictor variables X. This article adopts these assumptions unless otherwise stated. Note that in more advanced treatments all of these assumptions may be relaxed. In particular note that the assumption that the error terms are normally distributed is of no consequence unless the sample is very small because central limit theorems imply that, so long as the error terms have finite variance and are not too strongly correlated, the parameter estimates will be approximately normally distributed even when the underlying errors are not.

An equivalent, under these assumptions, formulation of simple linear regression that explicitly shows the linear regression as a model of conditional expectation can be given as

mbox{E}(y|x) = alpha + beta x.,

The conditional distribution of y given x is a linear transformation of the distribution of the error term. Note that this expression follows from the assumption that the mean of varepsilon is zero conditional on x.

Notation and naming conventions

  • Scalars and vectors are denoted by lower case letters.
  • Matrices are denoted by upper case letters.
  • Parameters are denoted by Greek letters.
  • Vectors and matrices are denoted by bold letters. Using the symbols defined above, for example, vector y is the column vector of y1, y2, ..., yn:

bold y =
begin{pmatrix}
 y_1 
 y_2 
 vdots 
 y_n
end{pmatrix} .

  • A parameter with a hat, such as hatbeta_j, denotes a parameter estimator.

Least-squares analysis

The first objective of regression analysis is to best-fit the data by adjusting the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. The objective function, S, is defined as a sum of squared residuals, ri

S=sum_{i=1}^{n}r_i^2 quad bigg[text {in matrix notation: } S = bold {r^T r} bigg],

where each residual is the difference between the observed value and the value calculated by the model:

r_i = y_i-sum_{j=1}^{m} X_{ij} hatbeta_j
quad bigg[text {in matrix notation: } bold {r} = bold {y} - bold {X hatbeta} bigg]. The best fit is obtained when S, the sum of squared residuals, is minimized. Subject to certain conditions, the parameters then have minimum variance (Gauss–Markov theorem) and may also represent a maximum likelihood solution to the optimization problem.

From the theory of linear least squares, the parameter estimators are found by solving the normal equations

sum_{i=1}^{n}sum_{k=1}^{m} X_{ij}X_{ik}hat beta_k=sum_{i=1}^{n} X_{ij}y_i,~~~ j=1,ldots,m.
In matrix notation, these equations are written as
mathbf{left(X^TXright) boldsymbol {hat beta}=X^Ty},

And thus when the matrix X^TX is non singular:

boldsymbol {hat beta} = mathbf{left(X^TXright)^{-1} X^Ty},
Specifically, for straight-line fitting, this is shown in straight line fitting.

Regression inference

The estimates can be used to test various hypotheses.

Denote by sigma^2 the variance of the error term varepsilon (recall we assume that varepsilon_i sim N(0,sigma^2), for every i=1,ldots,n ). An unbiased estimate of sigma ^2 is given by

hat sigma^2 = frac {S} {n-m}.
The relation between the estimate and the true value is:
hatsigma^2 cdot frac{n-m}{sigma^2} sim chi_{n-m}^2
where chi_{n-m}^2 has Chi-square distribution with n-m degrees of freedom.

The solution to the normal equations can be written as

hatboldsymbolbeta=(mathbf{X^TX)^{-1}X^Ty}.
This shows that the parameter estimators are linear combinations of the dependent variable. It follows that, if the observational errors are normally distributed, the parameter estimators will follow a joint normal distribution. Under the assumptions here, the estimated parameter vector is exactly distributed,

hatbeta sim N (beta, sigma^2 (X^TX)^{-1} )

where N(cdot) denotes the multivariate normal distribution.

The standard error of a parameter estimator is given by

hatsigma_j=sqrt{ frac{S}{n-m}left[mathbf{(X^TX)}^{-1}right]_{jj}}.
The 100(1-alpha)% confidence interval for the parameter, beta_j , is computed as follows:
hat beta_j pm t_{frac{alpha }{2},n - m} hat sigma_j.

The residuals can be expressed as

mathbf{hat r = y-X hatboldsymbolbeta= y-X(X^TX)^{-1}X^Ty}.,
The matrix mathbf{X(X^TX)^{-1}X^T} is known as the hat matrix and has the useful property that it is idempotent. Using this property it can be shown that, if the errors are normally distributed, the residuals will follow a normal distribution with covariance matrix (I - X(X^TX)^{-1}X^T)y,. Studentized residuals are useful in testing for outliers.

Given a value of the independent variable, xd, the predicted response is calculated as

y_d=sum_{j=1}^{m}X_{dj}hatbeta_j.
Writing the elements X_{dj}, j=1,m as mathbf z, the 100(1-alpha)% mean response confidence interval for the prediction is given, using error propagation theory, by:
mathbf{z^Thatboldsymbolbeta} pm t_{ frac{alpha }{2} ,n-m} hat sigma sqrt {mathbf{ z^T(X^T X)^{- 1}z }}.

The 100(1-alpha)% predicted response confidence intervals for the data are given by:

mathbf z^T hatboldsymbolbeta pm t_{frac{alpha }{2},n-m} hat sigma sqrt {1 + mathbf{z^T(X^TX)^{-1}z}}.

Univariate linear case

In the case that the formula to be fitted is a straight line, y=alpha+beta x!, the normal equations are

begin{array}{lcl}
n alpha + sum x_i beta =sum y_i sum x_i alpha + sum x_i^2 beta =sum x_iy_i end{array}

where all summations are from i = 1 to i = n. In this case, β is called the regression coefficient.

Thence, by Cramer's rule,

hatbeta = frac {n sum x_iy_i - sum x_i sum y_i} {Delta}
=frac{sum(x_i-bar{x})(y_i-bar{y})}{sum(x_i-bar{x})^2} ,

hatalpha = frac {sum x_i^2 sum y_i - sum x_i sum x_iy_i} {Delta}= bar y-bar x hatbeta

S = sum (y_i-hat y_i)^2
= sum y_i^2 - frac {n (sum x_i y_i)^2 + (sum y_i)^2 sum x_i^2 - 2 sum x_i sum y_i sum x_i y_i } {Delta}

hat sigma^2 = frac {S} {n-2}.

where

Delta = nsum x_i^2 - left(sum x_iright)^2

The covariance matrix is

frac{1}{Delta}begin{pmatrix}
 sum x_i^2 & -sum x_i 
 -sum x_i & n
end{pmatrix}

The mean response confidence interval is given by

y_d = (alpha+hatbeta x_d) pm t_{ frac{alpha }{2} ,n-2} hat sigma sqrt {frac{1}{n} + frac{(x_d - bar{x})^2}{sum (x_i - bar{x})^2}}

The predicted response confidence interval is given by

y_d = (alpha+hatbeta x_d) pm t_{ frac{alpha }{2} ,n-2} hat sigma sqrt {1+frac{1}{n} + frac{(x_d - bar{x})^2}{sum (x_i - bar{x})^2}}

Analysis of variance

In analysis of variance (ANOVA), the total sum of squares is split into two or more components.

The "total (corrected) sum of squares" is

text{SST} = sum_{i=1}^n (y_i - bar y)^2,

where

bar y = frac{1}{n} sum_i y_i

("corrected" means scriptstylebar y, has been subtracted from each y-value).

The total sum of squares is partitioned as the sum of the "regression sum of squares" SSReg (or RSS, also called the "explained sum of squares") and the "error sum of squares" SSE, which is the sum of squares of residuals.

The regression sum of squares is

text{SSReg} = sum left(hat y_i - bar y right)^2
= hatboldsymbolbeta^T mathbf{X}^T mathbf y - frac{1}{n}left(mathbf {y^T u u^T y} right),

where u is an n-by-1 vector in which each element is 1. Note that

mathbf{y^T u} = mathbf{u^T y} = sum_i y_i,

and

frac{1}{n} mathbf{y^T u u^T y} = frac{1}{n}left(sum_i y_iright)^2.,

The error (or "unexplained") sum of squares SSE, which is the sum of square of residuals, is given by

text{SSE} = sum_i {left({y_i - hat y_i} right)^2 }
= mathbf{ y^T y - hatboldsymbolbeta^T X^T y}.

The total sum of squares SST is

text{SST} = sum_i left(y_i-bar y right)^2 = mathbf{ y^T y}-frac{1}{n}left(mathbf{y^Tuu^Ty}right)=text{SSReg}+ text{SSE}.

Pearson's coefficient of regression, R 2 is then given as

R^2 = frac{text{SSReg}}}
= 1 - frac{text{SSE}}{text{SST}}.

If the errors are independent and normally distributed with expected value 0 and they all have the same variance, then under the null hypothesis that all of the elements in β = 0 except the constant, the statistic

frac{ R^2 / (m-1 ) }{ (1 - R^2 )/(n-m)}

follows an F-distribution with (m-1) and (n-m) degrees of freedom. If that statistic is too large, then one rejects the null hypothesis. How large is too large depends on the level of the test, which is the tolerated probability of type I error; see statistical significance.

Example

To illustrate the various goals of regression, we give an example. The following data set gives the average heights and weights for American women aged 30-39 (source: The World Almanac and Book of Facts, 1975).
Height/ m 1.47 1.5 1.52 1.55 1.57 1.60 1.63 1.65 1.68 1.7 1.73 1.75 1.78 1.8 1.83
Weight/kg 52.21 53.12 54.48 55.84 57.2 58.57 59.93 61.29 63.11 64.47 66.28 68.1 69.92 72.19 74.46
A plot of weight against height (see below) shows that it cannot be modeled by a straight line, so a regression is performed by modeling the data by a parabola.
y = beta_0 + beta_1 x+ beta_2 x^2 +varepsilon!
where the dependent variable, y, is weight and the independent variable, x is height. Place the coefficients, 1, x_i mbox{and} x_i^2, of the parameters for the ith observation in the ith row of the matrix X.
X=
begin{bmatrix} 1&1.47&2.16 1&1.50&2.25 1&1.52&2.31 1&1.55&2.40 1&1.57&2.46 1&1.60&2.56 1&1.63&2.66 1&1.65&2.72 1&1.68&2.82 1&1.70&2.89 1&1.73&2.99 1&1.75&3.06 1&1.78&3.17 1&1.81&3.24 1&1.83&3.35 end{bmatrix}

The values of the parameters are found by solving the normal equations

mathbf{(X^TX)boldsymbolhatbeta=X^Ty}
Element ij of the normal equation matrix, mathbf{X^TX} is formed by summing the products of column i and column j of X.
X_{ij}=sum_{k=1}^{k=15} X_{ki}X_{kj}
Element i of the right-hand side vector mathbf{X^Ty} is formed by summing the products of column i of X with the column of dependent variable values.
left(mathbf{X^Ty}right)_i=sum X_{ki} y_k

Thus, the normal equations are

begin{bmatrix}
15&24.76&41.05 24.76&41.05&68.37 41.05&68.37&114.35 end{bmatrix}

begin{bmatrix} hatbeta_0 hatbeta_1 hatbeta_2 end{bmatrix} = begin{bmatrix} 931 1548 2586 end{bmatrix}

hatbeta_0=129 pm 16 (value pm standard deviation)
hatbeta_1=-143 pm 20
hatbeta_2=62 pm 6

The calculated values are given by

y^{calc}_i = hatbeta_0 + hatbeta_1 x_i+ hatbeta_2 x^2_i
The observed and calculated data are plotted together and the residuals, y_i-y^{calc}_i, are calculated and plotted. Standard deviations are calculated using the sum of squares, S=0.76.

The confidence intervals are computed using:

[hat{beta_j}-sigma_j t_{m-n;1-frac{alpha}{2}};hat{beta_j}+sigma_j t_{m-n;1-frac{alpha}{2}}]

with alpha=5%, t_{m-n;1-frac{alpha}{2}} = 2.2. Therefore, we can say that the 95% confidence intervals are:

beta_0in[92.9,164.7]

beta_1in[-186.8,-99.5]

beta_2in[48.7,75.2]

Examining results of regression models

Checking model assumptions

Some of the model assumptions can be evaluated by calculating the residuals and plotting or otherwise analyzing them. The following plots can be constructed to test the validity of the assumptions:

  1. Residuals against the explanatory variables in the model, as illustrated above. The residuals should have no relation to these variables (look for possible non-linear relations) and the spread of the residuals should be the same over the whole range.
  2. Residuals against explanatory variables not in the model. Any relation of the residuals to these variables would suggest considering these variables for inclusion in the model.
  3. Residuals against the fitted values, hatmathbf y,.
  4. A time series plot of the residuals, that is, plotting the residuals as a function of time.
  5. Residuals against the preceding residual.
  6. A normal probability plot of the residuals to test normality. The points should lie along a straight line.

There should not be any noticeable pattern to the data in all but the last plot.

Assessing goodness of fit

  1. The coefficient of determination gives what fraction of the observed variance of the response variable can be explained by the given variables.
  2. Examine the observational and prediction confidence intervals. In most contexts, the smaller they are the better.

Other procedures

Generalized least squares

Generalized least squares, which includes weighted least squares as a special case, can be used when the observational errors have unequal variance or serial correlation.

Errors-in-variables model

Errors-in-variables model or total least squares when the independent variables are subject to error

Generalized linear model

Generalized linear model is used when the distribution function of the errors is not a Normal distribution. Examples include exponential distribution, gamma distribution, inverse Gaussian distribution, Poisson distribution, binomial distribution, multinomial distribution

Robust regression

A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.

Among Stata users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.

Instrumental variables and related methods

The assumption that the error term in the linear model can be treated as uncorrelated with the independent variables will frequently be untenable, as omitted-variables bias, "reverse" causation, and errors-in-variables problems can generate such a correlation. Instrumental variable and other methods can be used in such cases.

Applications of linear regression

Linear regression is widely used in biological, behavioral and social sciences to describe relationships between variables. It ranks as one of the most important tools used in these disciplines.

The trend line

For trend lines as used in technical analysis, see Trend lines (technical analysis)

A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.

Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.

Epidemiology

As one example, early evidence relating tobacco smoking to mortality and morbidity came from studies employing regression. Researchers usually include several variables in their regression analysis in an effort to remove factors that might produce spurious correlations. For the cigarette smoking example, researchers might include socio-economic status in addition to smoking to ensure that any observed effect of smoking on mortality is not due to some effect of education or income. However, it is never possible to include all possible confounding variables in a study employing regression. For the smoking example, a hypothetical gene might increase mortality and also cause people to smoke more. For this reason, randomized controlled trials are often able to generate more compelling evidence of causal relationships than correlational analysis using linear regression. When controlled experiments are not feasible, variants of regression analysis such as instrumental variables and other methods may be used to attempt to estimate causal relationships from observational data.

Finance

The capital asset pricing model uses linear regression as well as the concept of Beta for analyzing and quantifying the systematic risk of an investment. This comes directly from the Beta coefficient of the linear regression model that relates the return on the investment to the return on all risky assets.

Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, i.e. it only considers residuals in the dependent variable.

See also

References

Additional sources

  • Cohen, J., Cohen P., West, S.G., & Aiken, L.S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. (2nd ed.) Hillsdale, NJ: Lawrence Erlbaum Associates
  • Charles Darwin. The Variation of Animals and Plants under Domestication. (1869) (Chapter XIII describes what was known about reversion in Galton's time. Darwin uses the term "reversion".)
  • Draper, N.R. and Smith, H. Applied Regression Analysis Wiley Series in Probability and Statistics (1998)
  • Francis Galton. "Regression Towards Mediocrity in Hereditary Stature," Journal of the Anthropological Institute, 15:246-263 (1886). (Facsimile at: )
  • Robert S. Pindyck and Daniel L. Rubinfeld (1998, 4h ed.). Econometric Models and Economic Forecasts,, ch. 1 (Intro, incl. appendices on Σ operators & derivation of parameter est.) & Appendix 4.3 (mult. regression in matrix form).

External links

Search another word or see linear regressionon Dictionary | Thesaurus |Spanish
Copyright © 2014 Dictionary.com, LLC. All rights reserved.
  • Please Login or Sign Up to use the Recent Searches feature
FAVORITES
RECENT

;