Part
VII: Multiple Regression (MR) |
| | Bivariate
Regression Revisited | Types of
Correlations | Extenting Regression | | Accuracy of Prediction | Contributions of Variables | Methods of Variable Entry | | Assumptions of MR | The Problem of Shrinkage | |
| The raw score formula for
the regression line in simple regression is Y = bx + a.
The "weights" for this line are selected on the
basis of the Least Squares Criterion, where the
sum of the squared residuals (the difference between the
actual scores and the prediction line) is at a minimum
and the sum of squares for the regression (the difference
between the prediction line and the mean) is at a
maximum.
Often, you may need to include more than one predictor in order to enhance the prediction of Y. However, in this case, predictor variables are usually correlated. The problems that this can cause in terms of accounting variance can be diagrammed:
|
||
The analysis of the
various overlaps presents a problem in terms of
correlations. For example, the correlation between x1
and y is accounting for variance also predicted by
x2. However, this problem can be
corrected for mathematically. There are three types of
correlations which are involved in prediction and
regression:
|
| While bivariate
regression utilizes a regression line as the basis of
prediction for Y, multiple regression utilizes a three
dimensional plane (in the two predictor case). Hence, the
formula simply adds terms for each predictor with each
term having its own coefficient. Once again, the Least
Squares Criterion is used to minimize the error of
prediction. In this case, the "weights" are
known as unstandardized regression coefficients but
can be expressed as standardized regression
coefficients by converting to z-scores by dividing
the standard deviation of y by the standard
deviation of x , and multiplying this by the
unstandardized coefficient. When this is done, the intercept drops out of the regression formula. The standardized weights are, in effect, part correlations where the other predictors are removed from each other. In this way, the regression formula accounts for the maximum amount of variance that can be predicted. Overall, the MR coefficient [multiple R] can be interpreted like a Pearson's correlation coefficient. In other words, R Squared is the percent of Y variance accounted for by the predictors. The formula for the multiple correlation in the two predictor case is as follows:
|
| The test of significance
for R is as follows:
where N = number of subjects and k = number of predictors What does significance of R mean in terms of prediction?
|
Once it is determined
that the overall set of predictors is significant, it is
usually of interest to know which variables account for
the most variance in Y. There are three basic indices of
relative contribution of a variable:
|
| Remember how the
predictors give redundant information in the prediction
of Y? This is the cause of an important methodological
consideration when it comes to selecting which variables
should be used in the MR equation. For example, a
predictor entered late in the equation may contribute
very little in terms of prediction because all previous
predictors accounted for the variance. However, if the
predictor had been entered first, it may have accounted
for all that variance and the others may not have
contributed anything above and beyond it. There are two general categories of variable entry methods used with MR:
|
Despite its versatility,
multiple regression does make assumptions about the
nature of the relationships between variables:
|
As described earlier,
shrinkage occurs because MR capitalizes on mathematical
derivations of the sample; beta weights are determined
using the least-squares criterion and will likely not
apply to a new sample very well. This has three basic
causes:
Shrinkage can be handled in three basic ways:
|