Homoskedasticity refers to the condition that the variance of the error term is constant for all independent variables, X, from i = 1 to n: Var(εi| Xi) = σ2.
Heteroskedasticity means that the dispersion of the error terms varies over the sample. It may take the form of conditional heteroskedasticity, which says that the variance is a function of the independent variables. Which creates significant problem for statistical inference.
Effects of Heteroskedasticity on regression
The standard errors are usually unreliable estimates.
The coefficient estimates are consistent and unbiased
Because of unreliable standard errors, hypothesis testing is unreliable.
Scatterplot of residuals versus one of the independent variable can reveal patterns among the observations. One more method is Hypothesis testing using Chi squared test.
If conditional Heteroskedasticity is detected, white standard error can be used in hypothesis testing instead of the standard errors from OLS estimation procedures.
Multicollinearity refers to the condition when two or more of the independent variables, or linear combinations of the independent variables, in a multiple regression are highly correlated with each other. This condition distorts the standard error of the regression and the coefficient standard errors, leading to problems when conducting t-tests for statistical significance of parameters.
If one of the independent variables is a perfect linear combination of the other independent variables, then the model is said to exhibit perfect multicollinearity.
Imperfect multicollinearity arises when two or more independent variables are highly correlated, but less than perfectly correlated.
EFFECT OF MULTICOLLINEARITY ON REGRESSION ANALYSIS.
As a result of multicollinearity, there is a greater probability that we will incorrectly conclude that a variable is not statistically significant (e.g., a Type II error).
The most common way to detect multicollinearity is the situation where t-tests indicate that none of the individual coefficients is significantly different than zero, while the R2 is high.
High correlation among the independent variables suggests the possibility of multicollinearity, but low correlation among the independent variables does not necessarily indicate multicollinearity is not present.
The most common method to correct for multicollinearity is to omit one or more of the correlated independent variables.
There are statistical procedures that may help in this effort, like stepwise regression, which systematically remove variables from the regression until multicollinearity is minimized.
Omitted variable bias is present when two conditions are
MODEL MISPECIFICATION #
(1) the omitted variable is correlated with the movement of the independent variable in the model, and
(2) the omitted variable is a determinant of the dependent variable.
Omitting a relevant independent variable in a multiple regression results in regression coefficients that are biased and inconsistent, which means we would not have any confidence in our hypothesis tests of the coefficients or in the predictions of the model
BIAS VARIANCE TRADE OFF #
Model with too many variables performs poorly in out of sample data due to overfitting problem. Overfit models models have high bias error. Smaller models have high in sample variance errors (Lower R2). There are two ways to deal wit this bias variance tradeoff
- General to specific model: Starting with large model and dropping variable one by one that have smallest t stat.
- m fold cross validation: Involves dividing samples into m parts and then using m-1 parts to fit the model and the remaining part to use for out of sample validation.
IDENTIFYING OUTLIERS #
Assumption of no outlier is violated in case of outliers in model. One metric to find out outlier is Cooks measure.
BLUE ESTIMATORS #
The Gauss-Markov theorem says that if the linear regression model assumptions are true and the regression errors display homoskedasticity, then the OLS estimators have the following properties.
- The OLS estimated coefficients have the minimum variance compared to other methods of estimating the coefficients (i.e., they are the most precise).
- The OLS estimated coefficients are based on linear functions.
- The OLS estimated coefficients are unbiased, which means that in repeated sampling the averages of the coefficients from the sample will be distributed around the true population parameters [i.e., E(b0) = B0 and E(b1) = B1].
- The OLS estimate of the variance of the errors is unbiased. The acronym for these properties is “BLUE,” which indicates that OLS estimators