Multiple Regression Calculator with Steps

The difference between a multiple regression and a simple linear regression is that in a multiple regression there are more than one independent variable (x). Although it's not stated in its name, there is still a linear relationship between the dependent (y) and independent variables in multiple regression. Generally speaking, there are a total of $p$ independent variables in multiple regression. Here, $p$ can take any value greater than one. If $p$ is equal to one, then it is just a simple linear regression. The estimated multiple regression equation is given below.

Estimated Regression Equation

$ \hat{y} = b_0 + b_1 x_1 + b_2 x_2 + \cdots + b_p x_p $

As in simple linear regression, the coefficient in multiple regression are found using the least squared method. That is, the coefficients are chosen such that the sum of the square of the residuals are minimized. The difference is that the formulas for the coefficients can be expressed using algebra in simple linear regression. In multiple regression, the formulas for the coefficients require the use of more advanced math, specifically matrix algebra. Because of this, calculation by hand of coefficients in multiple regression is usually avoided and the focus is on the interpretation of the coefficients.

Sum of Squares Relationship

$ \text{SST} = \text{SSR} + \text{SSE} $

The coefficient of determination, or r-squared, in multiple regression is computed in the same way as it is in simple linear regression. However, there is a problem in using it in multiple regression. That problem is that the r-squared naturally increases as you add more independent variables, even if those independent variables aren't relevant. To solve this problem, the adjusted coefficient of determination is preferred in multiple regression. The formula for the adjusted r-squared is given below.

Adjusted Coefficient of Determination

$ R_a^2 = 1 - (1 - R^2) \dfrac{n-1}{n-p-1} $

The interpretation of the coefficient of determination is the same as it is in simple linear regression. That is, the first step is to convert it from a decimal to a percentage by multiplying by 100%. Then its the percentage of variability in the dependent variable explained by the estimated regression equation. In the case of multiple regression, you'd want to use the adjusted r-squared instead of the regular r-squared. The interpretation of the slope coefficients are that they give the predicted changed in the dependent variable corresponding to a one-unit increase in the independent variable, holding the other independent variables constant.

As in simple linear regression, testing for significance for multiple regression involves either the use of the F-test or t-test. However, while the two tests are the same in simple regression, they are different in multiple regression. In multiple regression, the F-test is a simultaneous test for significance for all the independent variables. If the null hypothesis in the F-test is rejected, then at least one of the independent variables is significant. If the null hypothesis is not rejected, then none of the independent variables are significant.

F Test

$ H_0 \colon \beta_1 = \beta_2 = \cdots = \beta_p = 0 $
$ H_a \colon $ One or more of the $\beta_i \neq 0$

If the F test passes (i.e., null hypothesis is rejected) in multiple regression, then we can proceed to do t tests. Once we know that at least one of the independent variables are significant, the t-tests can be used to determine which ones are significant. So the t-tests are performed on each individual independent variable. Rejecting the null hypothesis in a t-test means that the independent variable is significant. So while the two tests of significance are subsitutes in simple regression, they complement each other in multiple regression.

t Tests

$ H_0: \beta_i = 0 $
$ H_a: \beta_i \neq 0 $

One of the obstacles commonly reached in multiple regression is running into categorical (or qualitative) data. Categorical data is data that does not involve numbers, such as gender or country. The problem with using categorical data in regression is that the least squares method requires numerical data to compute the estimated coefficients. This issues is resolved in multiple regression through the use of dummy variables. A dummy variable takes the value one for one category and zero for the other category. When there are more than two categories, more than one dummy variable is used.

Dummy Variable

$ x_i = \begin{cases} 1 \text{ if category 1} \\ 0 \text{ if category 2} \end{cases} $

While a multiple regression can provide great predictive power, oftentimes a simple linear regression is enough. To compute a simple linear regression and the associated statistics, visit the Simple Regression Calculator. The F test and t test in multiple regression are two examples of hypothesis tests. To perform hypothesis tests, visit the Hypothesis Testing Calculator.