Linear Regression


$x=$
$x_p=$
Example 1Example 2

Linear regression is a statistical method for estimating the relationship between two or more variables, assuming a linear relationship between them. Their are two goals to linear regression: prediction and explanation. The variable that is being predicted (or explained) is called the dependent variable and denoted by "y". The variable(s) used to predict (or explain) the dependent variable are called the independent variable(s), or regressor(s), and are denoted by "x"s. There are various kinds of linear regression. The simplest kind of linear regression, where there is only one dependent variable (x), is called "simple linear regression". When more than one independent variable is used (multiple x's), the linear regression is known as "multiple regression". Linear regression can even be modified to allow for non-linear relationship between the variables.

The simple linear regression model describes how the dependent variable (y) is related to the independent variable (x) and an error term (u). The equation for the simple linear regression model is y = beta_0 + beta_1 x = u. Ignoring the error term (u) for a second, we see that we have the quation of a straight line. Given this is the equation of a straight line, this would mean beta_0 is the y-intercept and beta_1 is the slope.

In simple linear regression, the starting point is the estimated regression equation: ŷ = b0 + b1x. It provides a mathematical relationship between the dependent variable (y) and the independent variable (x). Furthermore, it can be used to predict the value of y for a given value of x. There are two things we need to get the estimated regression equation: the slope (b1) and the intercept (b0). The formulas for the slope and intercept are derived from the least squares method: min Σ(y - ŷ)2. The graph of the estimated regression equation is known as the estimated regression line.

After the estimated regression equation, the second most important aspect of simple linear regression is the coefficient of determination. The coefficient of determination, denoted r2, provides a measure of goodness of fit for the estimated regression equation. Before we can find the r2, we must find the values of the three sum of squares: Sum of Squares Total (SST), Sum of Squares Regression (SSR) and Sum of Squares Error (SSE). The relationship between them is given by SST = SSR + SSE. So, given the value of any two sum of squares, the third one can be easily found.

Sum of Squares
Error $ \text{SSE} = \sum (y-\hat{y})^{\color{Black} 2} $
Regression $ \text{SSR} = \sum (\hat{y}-\bar{y})^2 $
Total $ \text{SST} = \sum (y-\bar{y})^2 $

Now that we know the sum of squares, we can calculate the coefficient of determination. The r2 is the ratio of the SSR to the SST. It takes a value between zero and one, with zero indicating the worst fit and one indicating a perfect fit. A perfect fit indicates all the points in a scatter diagram will lie on the estimated regression line. When interpreting the r2, the first step is to convert its value to a percentage. Then it can be interpreted as the percentage of the variability in y explained by the estimated regression equation.

Coefficient of Determination
$ r^2 = \dfrac{\text{SSR}}{\text{SST}} $

The sample correlation coefficient can be calculated using the coefficient of determination, indicating a close relationship between regression and correlation. Regression can be thought of as a stronger version of regression. While correlation tells us the sign and strength of a relationship, regression quantifies the relationship to facilitate prediction. To get the sample correlation coefficient, simply take the square root of the coefficient of determination, with the sign being the same sign as the slope.

The next step in regression analysis is to test for significance. That is, we want to determine whether there is a statistically significant relationship between x and y. There are two ways of testing for significance, either with a t-Test or an F-Test. The first step in both tests is to calculate the Mean Square Error (MSE), which provides an estimate of the variance of the error. The square root of the MSE is called the Standard Error of Estimate and provides an estimate of the standard deviation of the error.

Mean Square Error Standard Error of Estimate
$ \text{MSE} = \dfrac{\text{SSE}}{n-2}$ $ s = \sqrt{\text{MSE}} $

The t test is a hypothesis test about the true value of the slope, denoted $\beta_1$. The test statistic for this hypothesis test is calculated by dividing the estimated slope, b1, by the estimated standard deviation of b1, $ s_{b_1}$. The latter is calculated using the formula $ s_{b_1} = \frac{s}{\sqrt{\sum (x-\bar{x})^2}} $. The test statistic is then used to conduct the hypothesis, using a t distribution with n-2 degrees of freedom. In simple linear regression, the F test amounts to the same hypothesis test as the t test. The only difference will be the test statistic and the probability distribution used.

t Test Test Statistic
$ H_0: \beta_1 = 0 $
$ H_a: \beta_1 \neq 0 $
$ t = \dfrac{b_1}{s_{b_1}} $

Confidence intervals and predictions intervals can be constructed around the estimated regression line. In both cases, the intervals will be narrowest near the mean of x and get wider the further they move from the mean. The differennce between them is that a confidence interval gives a range for the expected value of y. A prediction interval gives a range for the predicted value of y. Confidence intervals will be narrower than prediction intervals.