MULTIPLE REGRESSION BASICS #
Multiple regression is regression analysis with more than one independent variable. It is used to quantify the influence of two or more independent variables on a dependent variable.
The general multiple linear regression model is:
Yi = B0 +B1X1 + B2X2 + … + BKXKi + εi
OLS ESTIMATOR IN A MULTIPLE REGRESSION
The multiple regression methodology estimates the intercept and slope coefficients such that the sum of the squared error terms, is minimized. The estimators of these coefficients are known as ordinary least squares (OLS) estimators.
INTERPRETING THE SLOPE COEFFICIENT IN A MULTIPLE REGRESSION:
The intercept term is the value of the dependent variable when the independent variables are all equal to 0.
Each slope coefficient is the estimated change in the dependent variable for a one-unit change in that independent variable, holding the other independent variables constant. That’s why the slope coefficients in a multiple regression are sometimes called partial slope coefficients.
ASSUMPTIONS MULTIPLE REGRESSION #
- A linear relationship exists between the dependent and independent variables.
- The independent variables are not random, and there is no exact linear relation between any two or more independent variables.
- The expected value of the error term, conditional on the independent variables, is zero.
- The variance of the error terms is constant for all observations.
- The error term for one observation is not correlated with that of another observation.
- The error term is normally distributed.
MEASURE OF FIT #
The standard error of the regression (SER) measures the uncertainty about the accuracy A of the predicted values of the dependent variable. Formally, SER is the standard deviation of the predicted values for the dependent variable about the regression line. Equivalently, it is the standard deviation of the error terms in the regression. SER measures the degree of variability of the actual Y-values relative to the estimated Y-values. The SER gauges the “fit” of the regression line. The smaller the standard error, the better the fit.
COEFFICIENT OF DETERMINATION, R2
The multiple coefficient of determination, R2, can be used to test the overall effectiveness of the entire set of independent variables in explaining the dependent variable.
R2 is calculated the same way as in simple linear regression.
R2 = (TSS-SSR)/ TSS = Explained variation/total variation
ADJUSTED R2
R2 almost always increases as independent variables are added to the model, even if the marginal contribution of the new variables is not statistically significant. This problem is often referred to as overestimating the regression.
To overcome the problem of overestimating the impact of additional variables on the explanatory power of a regression model, many researchers recommend adjusting R2 for the number of independent variables. The adjusted R2 value is expressed as:
R2a = 1– [(n-1/n-k-1)] x (1-R2)
Where R^2a = Adjusted R^2
R^2a is less than or equal to R^2.
R ^2a can be negative in if R^2 is too small.
REGRESSION RESULT INTERPRETATION #
INTERPRETING REGRESSION RESULTS
Just as in simple linear regression, the variability of the dependent variable or total sum of squares (TSS) can be broken down into explained sum of squares (ESS) and sum of squared residuals (SSR). The coefficient of determination is:
R2= ESS/TSS
The coefficient of multiple correlation is simply the square root of R-squared. In the case of a multiple regression, the coefficient of multiple correlation is always positive.
SPECIFICATION BIAS
Specification bias refers to how the slope coefficient and other statistics for a given independent variable are usually different in a simple regression when compared to those of the same variable when included in a multiple regression.
HYPOTHESIS TESTING OF COEFFICIENTS #
The t-statistic used to test the significance of the individual coefficients in a multiple regression is calculated using the same formula that is used with simple linear regression:
t= (bj—Bj)/sbj = (estimated regression coefficient—hypothesized value )/ coefficient standard error of bj
The t-statistic has n-k-1 degrees of freedom.
DETERMINING STATISTICAL SIGNIFICANCE
Test the null hypothesis that the coefficient is zero versus the alternative that it is not:
Testing statistical significance H0:bj =0 versus HA: bj ≠ 0.
INTERPRETING THE p-VALUES
An alternative method of doing hypothesis testing of the coefficients is to compare the p-value to the significance level:
- If the p-value is less than significance level, the null hypothesis can be rejected.
- If the p-value is greater than the significance level, the null hypothesis cannot be rejected.
CONFIDENCE INTERVALS FOR A REGRESSION COEFFICIENT
The confidence interval for a regression coefficient in multiple regression is calculated and interpreted the same way as it is in simple linear regression.
“estimated regression coefficient ± (critical t-value) (coefficient standard error)”
The critical t-value is a two-tailed value with n — k — 1 degrees of freedom and a 5% significance level, where n is the number of observations and k is the number of independent variables.
PREDICTING THE DEPENDENT VARIABLES
We can use the regression equation to make predictions about the dependent variable based on forecasted values of the independent variables. The process is similar to forecasting with simple linear regression, only now we need predicted values for more than one independent variable. The predicted value of dependent variable Y is:
JOINT HYPOTHESIS TESTING #
A joint hypothesis tests two or more coefficients at the same time. For example, we could develop a null hypothesis for a linear regression model with three independent variables that sets two of these coefficients equal to zero: H0: b1 = 0 and b2 = 0 versus the alternative hypothesis that one of them is not equal to zero. That is, if just one of the equalities in this null hypothesis does not hold, we can reject the entire null hypothesis.
THE F-STATISTICS
The F-statistic is used to test whether at least one of the independent variables explains a significant portion of the variation of the dependent variable.
The F-statistic, which is always a one-tailed test, is calculated as:
(ESS/k) / (SSR/n-k-1)
The degrees of freedom for the numerator and denominator are:
dfnumerator = k
dfdenominator = n—k—1
The decision rule for the F-test is:
Decision rule: reject H0 if F (test-statistic) > Fc (critical value)
R2 AND ADJUSTED R2 #
When computing both the R2 and the adjusted R2, there are a few pitfalls to acknowledge, which could lead to invalid conclusions.
- If adding an additional independent variable to the regression improves the R2, this variable is not necessary statistically significant.
- The R2 measure may be spurious, meaning that the independent variables may show a high R2; however, they are not the exact cause of the movement in the dependent variable.
- If the R2 is high, we cannot assume that we have found all relevant independent variables. Omitted variables may still exist, which would improve the regression results further.
- The R2 measure does not provide evidence that the most or least appropriate independent variables have been selected. Many factors go into finding the most robust regression model, including omitted variable analysis, economic theory, and the quality of data being used to generate the model.
RESTRICTED VS. UNRESTRICTED LEAST SQUARES MODELS
Restricted least squares models restrict one or more of the coefficients to equal a given value and compare the R2 of the restricted model to that of the unrestricted model where the coefficients are not restricted. An A-statistic can test if there is a significant difference between the restricted and unrestricted R2.