LINEAR REGRESSION BASIC TERMS #
A regression analysis has the goal of measuring how changes in one variable, called a dependent or explained variable can be explained by changes in one or more other variables called the independent or explanatory variables.
A scatter plot is a visual representation of the relationship between the dependent variable and a given independent variable.
POPULATION REGRESSION FUNCTION
Assuming that the 30 observations represent the population of hedge funds that are in the same class then their relationship can provide a population regression function. Such a function would consist of parameters called regression coefficient. The regression equation will include an intercept term and one slope coefficient for each independent variable.
THE ERROR TERM
There is a dispersion of y-values around each conditional expected value. The difference between each y & its corresponding conditional expectation is the error term or noise component denoted as εi.
Sample Regression Function: The sample regression function is an equation that represents a relationship between the y & x variable (s) that is based only on the information in a sample of the population.
Linear Regression Equation:
Y = a + b (X) + e
b is regression slope coefficient
a is intercept, value of Y when b is zero.
Three conditions , to use linear regression
- The relationship between Y and X should be linear.
- The error terms must be additive. (Variance of error term is independent of observed data)
- All X variables should be observable.
- If the relationship between X and Y is non linear then data is transformed first and then entered into linear equation.
KEY ASSUMPTIONS OF LINEAR REGRESSION #
- The expected value of the error term, conditional on the independent variable, is zero.
- All (X, Y) observations are independent and identically distributed (i.i.d.).
- It is unlikely that large outliers will be observed in the data. Large outliers have the potential to create misleading regression results.
- A linear relationship exists between the dependent and independent variable.
- The model is correctly specified in that it includes the appropriate independent variable & does not omit variables.
- The independent variable is uncorrelated with the error terms.
- The variance of £; is constant for all X.
- No serial correlation of the error terms exists
- The error term is normally distributed.
ORDINARY LEAST SQUARES REGRESSION #
Ordinary Least Squares estimation is a process that estimates the population parameters βi with corresponding values for bi that minimize the squared residuals. The formulas for the coefficients are:b1= cov(xy) / var(x)
b0 = ӯ – b1x
PROPERTIES OF OLS ESTIMATORS #
BENEFITS:
- Statistical software packages make it easy for users to apply OLS estimators.
- OLS estimated coefficients are unbiased, consistent & efficient.
PROPERTIES:
- OLS estimators have their own probability distribution.
- The mean of the sampling distribution is used as an estimator of the population mean & is said to be an unbiased estimator of the population mean.
- Given the central limit theorem, for large sample sizes, it is reasonable to assume that the sampling distribution will approach the normal distribution. This means that the estimator is also a consistent estimator.
OLS REGRESSION RESULTS #
The sum of squared residuals (SSR), sometimes denoted SSE, for sum of squared errors, is the sum of squares that results from placing a given intercept and slope coefficient into the equation and computing the residuals, squaring the residuals and summing them. It is represented by Ʃe2i.
THE COEFFICIENT OF DETERMINATION
The coefficient of determination represented by R2,is a measure of the “goodness of fit” of the regression. It is interpreted as a percentage of variation in the dependent variable explained by the independent variable. The underlying concept is that for the dependent variable, there is a Total sum of squares (TSS) around the sample mean.
Total Sum of Squares (TSS) = Explained sum of squares( ESS) + Sum of squared residuals (SEE)
Ʃ(yi–ӯ)2= Ʃ(ŷ-ӯ)2 + Ʃ(yi-ŷ)2
- The correlation coefficient indicates the sign of the relationship, whereas the coefficient of determination does not.
- The coefficient of determination can apply to an equation with several independent variables, and it implies a causation or explanatory power, while the correlation coefficient only applies to two variables & does not imply causation between the variable.
THE STANDARD ERROR OF THE REGRESSION
The standard error of the regression (SER) measures the degree of variability of the actual Y-values relative to the estimated Y-values from a regression equation. The SER gauges the “fit” of the regression line. The smaller the standard error, the better the fit.
DUMMY VARIABLE #
When the variable is binary in nature—it is either on or off falls under the category of dummy variable. Dummy variables are assigned value of 0 or 1.
COEFFICIENT OF DETERMINIATION R2 #
R2 of a regression model captures the fit of a model. Correlation coefficient is under root of R2, for one variable regressor model.
Formula of R2 is (ESS / TSS)
REGRESSION COEFFICIENT HYPOTHESIS TESTING #
A t-test may also be used to test the hypothesis that the true slope coefficient, B1, is equal to some hypothesized value. Letting b1 be the point estimate for B1 the appropriate test statistic with n — 2 degrees of freedom is:
t = (b1—B1)/sb1
The decision rule for tests of significance for regression coefficients is:
Reject H0 if t > +tcritical or t < –tcritical
Rejection of the null means that the slope coefficient is different from the hypothesized value of B1.
Hypothesis testing for a regression coefficient may use the confidence interval for the coefficient being tested. The null hypothesis is : H0 : Bi=0 & the alternative hypothesis is HA: B1≠0
If the confidence interval at the desired level of significance does not include zero, the null is rejected and the coefficient is said to be statistically different from zero.
The confidence interval for the regression coefficient, B1, is calculated as: b1 ± (tc x sb1)
tc is the critical two-tailed t-value for the selected confidence level with the appropriate number of degrees of freedom while equal to n-2.
The standard error of the regression coefficient is denoted as sb1.
P Value: The p-value is the smallest level of significance for which the null hypothesis can be rejected.
PREDICTED VALUES
Predicted values are values of the dependent variable based on the estimated regression coefficients and a prediction about the value of the independent variable.
For a simple regression, the predicted value of Y is:
Ŷ = b0 + b1Xp
Where: ŷ = predicted value of the dependent variable.
Xp= forecasted value of the independent variable.
CONFIDENCE INTERVALS FOR PREDICTED VALUES
The equation for the confidence interval for a predicted value of Y is: Ŷ ± (tc x sf) = [ŷ—(tc x sf) < Y< Ŷ + (tc x sf)]
Where: tc = two-tailed critical t-value at the desired level of significance with df = n — 2.
sf = standard error of the forecast.