In statistics, linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line. The results are subject to statistical analysis.
In general there are m parameters to be determined, . The model is a linear combination of these parameters,
In simple linear regression the data model represents a straight line and can be written as
A linear regression model need not be a linear function of the independent variable: linear in this context means that the conditional mean of y is linear in the parameters . For example, the model is linear in the parameters, and , but it is not linear in x, as , a nonlinear function of xi. An illustration of this model is shown in the example, below.
If there is any linear dependence amongst the columns of X, then the parameters β cannot be estimated by least squares unless β is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of β may still be uniquely estimable in such cases. For example, the model
Classical assumptions for linear regression include the assumptions that the sample is selected at random from the population of interest, that the dependent variable is continuous on the real line, and that the error terms follow identical and independent normal distributions, that is, that the errors are i.i.d. and Gaussian. Note that these assumptions imply that the error term does not statistically depend on the values of the independent variables, that is, that is statistically independent of the predictor variables X. This article adopts these assumptions unless otherwise stated. Note that in more advanced treatments all of these assumptions may be relaxed. In particular note that the assumption that the error terms are normally distributed is of no consequence unless the sample is very small because central limit theorems imply that, so long as the error terms have finite variance and are not too strongly correlated, the parameter estimates will be approximately normally distributed even when the underlying errors are not.
An equivalent, under these assumptions, formulation of simple linear regression that explicitly shows the linear regression as a model of conditional expectation can be given as
The conditional distribution of y given x is a linear transformation of the distribution of the error term. Note that this expression follows from the assumption that the mean of is zero conditional on x.
y_1
y_2
vdots
y_nend{pmatrix} .
The first objective of regression analysis is to best-fit the data by adjusting the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. The objective function, S, is defined as a sum of squared residuals, ri
where each residual is the difference between the observed value and the value calculated by the model:
From the theory of linear least squares, the parameter estimators are found by solving the normal equations
And thus when the matrix is non singular:
Denote by the variance of the error term (recall we assume that for every ). An unbiased estimate of is given by
The solution to the normal equations can be written as
where denotes the multivariate normal distribution.
The standard error of a parameter estimator is given by
The residuals can be expressed as
Given a value of the independent variable, xd, the predicted response is calculated as
The predicted response confidence intervals for the data are given by:
In the case that the formula to be fitted is a straight line, , the normal equations are
where all summations are from i = 1 to i = n. In this case, β is called the regression coefficient.
Thence, by Cramer's rule,
where
The covariance matrix is
sum x_i^2 & -sum x_i
-sum x_i & nend{pmatrix}
The mean response confidence interval is given by
The predicted response confidence interval is given by
The "total (corrected) sum of squares" is
where
("corrected" means has been subtracted from each y-value).
The total sum of squares is partitioned as the sum of the "regression sum of squares" SSReg (or RSS, also called the "explained sum of squares") and the "error sum of squares" SSE, which is the sum of squares of residuals.
The regression sum of squares is
where u is an n-by-1 vector in which each element is 1. Note that
and
The error (or "unexplained") sum of squares SSE, which is the sum of square of residuals, is given by
The total sum of squares SST is
Pearson's coefficient of regression, R 2 is then given as
If the errors are independent and normally distributed with expected value 0 and they all have the same variance, then under the null hypothesis that all of the elements in β = 0 except the constant, the statistic
follows an F-distribution with (m-1) and (n-m) degrees of freedom. If that statistic is too large, then one rejects the null hypothesis. How large is too large depends on the level of the test, which is the tolerated probability of type I error; see statistical significance.
| Height/ m | 1.47 | 1.5 | 1.52 | 1.55 | 1.57 | 1.60 | 1.63 | 1.65 | 1.68 | 1.7 | 1.73 | 1.75 | 1.78 | 1.8 | 1.83 |
| Weight/kg | 52.21 | 53.12 | 54.48 | 55.84 | 57.2 | 58.57 | 59.93 | 61.29 | 63.11 | 64.47 | 66.28 | 68.1 | 69.92 | 72.19 | 74.46 |
The values of the parameters are found by solving the normal equations
Thus, the normal equations are
begin{bmatrix} hatbeta_0 hatbeta_1 hatbeta_2 end{bmatrix} = begin{bmatrix} 931 1548 2586 end{bmatrix}
The calculated values are given by
The confidence intervals are computed using:
with =5%, = 2.2. Therefore, we can say that the 95% confidence intervals are:
There should not be any noticeable pattern to the data in all but the last plot.
Generalized linear model is used when the distribution function of the errors is not a Normal distribution. Examples include exponential distribution, gamma distribution, inverse Gaussian distribution, Poisson distribution, binomial distribution, multinomial distribution
A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.
Among Stata users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.
A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.
Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.
Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, i.e. it only considers residuals in the dependent variable.
)
for "error",
for "Gauss-Markov theorem",
for "method of least squares", and
for "regression".
"Mahler's Guide to Regression"
In statistics, linear regression is a form of regression analysis in which the relationship between one or more independent variables and another variable, called dependent variable, is modeled by a least squares function, called linear regression equation. This function is a linear combination of one or more model parameters, called regression coefficients. A linear regression equation with one independent variable represents a straight line. The results are subject to statistical analysis.
In general there are m parameters to be determined, . The model is a linear combination of these parameters,
In simple linear regression the data model represents a straight line and can be written as
A linear regression model need not be a linear function of the independent variable: linear in this context means that the conditional mean of y is linear in the parameters . For example, the model is linear in the parameters, and , but it is not linear in x, as , a nonlinear function of xi. An illustration of this model is shown in the example, below.
If there is any linear dependence amongst the columns of X, then the parameters β cannot be estimated by least squares unless β is constrained, as, for example, by requiring the sum of some of its components to be 0. However, some linear combinations of the components of β may still be uniquely estimable in such cases. For example, the model
Classical assumptions for linear regression include the assumptions that the sample is selected at random from the population of interest, that the dependent variable is continuous on the real line, and that the error terms follow identical and independent normal distributions, that is, that the errors are i.i.d. and Gaussian. Note that these assumptions imply that the error term does not statistically depend on the values of the independent variables, that is, that is statistically independent of the predictor variables X. This article adopts these assumptions unless otherwise stated. Note that in more advanced treatments all of these assumptions may be relaxed. In particular note that the assumption that the error terms are normally distributed is of no consequence unless the sample is very small because central limit theorems imply that, so long as the error terms have finite variance and are not too strongly correlated, the parameter estimates will be approximately normally distributed even when the underlying errors are not.
An equivalent, under these assumptions, formulation of simple linear regression that explicitly shows the linear regression as a model of conditional expectation can be given as
The conditional distribution of y given x is a linear transformation of the distribution of the error term. Note that this expression follows from the assumption that the mean of is zero conditional on x.
y_1
y_2
vdots
y_nend{pmatrix} .
The first objective of regression analysis is to best-fit the data by adjusting the parameters of the model. Of the different criteria that can be used to define what constitutes a best fit, the least squares criterion is a very powerful one. The objective function, S, is defined as a sum of squared residuals, ri
where each residual is the difference between the observed value and the value calculated by the model:
From the theory of linear least squares, the parameter estimators are found by solving the normal equations
And thus when the matrix is non singular:
Denote by the variance of the error term (recall we assume that for every ). An unbiased estimate of is given by
The solution to the normal equations can be written as
where denotes the multivariate normal distribution.
The standard error of a parameter estimator is given by
The residuals can be expressed as
Given a value of the independent variable, xd, the predicted response is calculated as
The predicted response confidence intervals for the data are given by:
In the case that the formula to be fitted is a straight line, , the normal equations are
where all summations are from i = 1 to i = n. In this case, β is called the regression coefficient.
Thence, by Cramer's rule,
where
The covariance matrix is
sum x_i^2 & -sum x_i
-sum x_i & nend{pmatrix}
The mean response confidence interval is given by
The predicted response confidence interval is given by
The "total (corrected) sum of squares" is
where
("corrected" means has been subtracted from each y-value).
The total sum of squares is partitioned as the sum of the "regression sum of squares" SSReg (or RSS, also called the "explained sum of squares") and the "error sum of squares" SSE, which is the sum of squares of residuals.
The regression sum of squares is
where u is an n-by-1 vector in which each element is 1. Note that
and
The error (or "unexplained") sum of squares SSE, which is the sum of square of residuals, is given by
The total sum of squares SST is
Pearson's coefficient of regression, R 2 is then given as
If the errors are independent and normally distributed with expected value 0 and they all have the same variance, then under the null hypothesis that all of the elements in β = 0 except the constant, the statistic
follows an F-distribution with (m-1) and (n-m) degrees of freedom. If that statistic is too large, then one rejects the null hypothesis. How large is too large depends on the level of the test, which is the tolerated probability of type I error; see statistical significance.
| Height/ m | 1.47 | 1.5 | 1.52 | 1.55 | 1.57 | 1.60 | 1.63 | 1.65 | 1.68 | 1.7 | 1.73 | 1.75 | 1.78 | 1.8 | 1.83 |
| Weight/kg | 52.21 | 53.12 | 54.48 | 55.84 | 57.2 | 58.57 | 59.93 | 61.29 | 63.11 | 64.47 | 66.28 | 68.1 | 69.92 | 72.19 | 74.46 |
The values of the parameters are found by solving the normal equations
Thus, the normal equations are
begin{bmatrix} hatbeta_0 hatbeta_1 hatbeta_2 end{bmatrix} = begin{bmatrix} 931 1548 2586 end{bmatrix}
The calculated values are given by
The confidence intervals are computed using:
with =5%, = 2.2. Therefore, we can say that the 95% confidence intervals are:
There should not be any noticeable pattern to the data in all but the last plot.
Generalized linear model is used when the distribution function of the errors is not a Normal distribution. Examples include exponential distribution, gamma distribution, inverse Gaussian distribution, Poisson distribution, binomial distribution, multinomial distribution
A host of alternative approaches to the computation of regression parameters are included in the category known as robust regression. One technique minimizes the mean absolute error, or some other function of the residuals, instead of mean squared error as in linear regression. Robust regression is much more computationally intensive than linear regression and is somewhat more difficult to implement as well. While least squares estimates are not very sensitive to breaking the normality of the errors assumption, this is not true when the variance or mean of the error distribution is not bounded, or when an analyst that can identify outliers is unavailable.
Among Stata users, Robust regression is frequently taken to mean linear regression with Huber-White standard error estimates due to the naming conventions for regression commands. This procedure relaxes the assumption of homoscedasticity for variance estimates only; the predictors are still ordinary least squares (OLS) estimates. This occasionally leads to confusion; Stata users sometimes believe that linear regression is a robust method when this option is used, although it is actually not robust in the sense of outlier-resistance.
A trend line represents a trend, the long-term movement in time series data after other components have been accounted for. It tells whether a particular data set (say GDP, oil prices or stock prices) have increased or decreased over the period of time. A trend line could simply be drawn by eye through a set of data points, but more properly their position and slope is calculated using statistical techniques like linear regression. Trend lines typically are straight lines, although some variations use higher degree polynomials depending on the degree of curvature desired in the line.
Trend lines are sometimes used in business analytics to show changes in data over time. This has the advantage of being simple. Trend lines are often used to argue that a particular action or event (such as training, or an advertising campaign) caused observed changes at a point in time. This is a simple technique, and does not require a control group, experimental design, or a sophisticated analysis technique. However, it suffers from a lack of scientific validity in cases where other potential changes can affect the data.
Regression may not be the appropriate way to estimate beta in finance given that it is supposed to provide the volatility of an investment relative to the volatility of the market as a whole. This would require that both these variables be treated in the same way when estimating the slope. Whereas regression treats all variability as being in the investment returns variable, i.e. it only considers residuals in the dependent variable.
)
for "error",
for "Gauss-Markov theorem",
for "method of least squares", and
for "regression".
"Mahler's Guide to Regression"