Part VII: Multiple Regression (MR)
Prediction with Continuous Variables

 
 
The raw score formula for the regression line in simple regression is Y = bx + a. The "weights" for this line are selected on the basis of the Least Squares Criterion, where the sum of the squared residuals (the difference between the actual scores and the prediction line) is at a minimum and the sum of squares for the regression (the difference between the prediction line and the mean) is at a maximum.

Picture (221x158, 1.7Kb)Picture (175x21, 1.1Kb)

Often, you may need to include more than one predictor in order to enhance the prediction of Y. However, in this case, predictor variables are usually correlated. The problems that this can cause in terms of accounting variance can be diagrammed:

Picture (220x192, 2.1Kb) X1, X2, and Y represent the variables. The numbers reflect variance overlap as follows:
  1. Proportion of Y uniquely predicted by X2
  2. Proportion of Y redundantly predicted by X1 and X2
  3. Proportion of variance shared by X1 and X2
  4. Proportion of Y uniquely predicted by X1

Given the redundant information inherent in X1 and X2, how do we optimally combine X1 and X2 to predict Y?

 
 
The analysis of the various overlaps presents a problem in terms of correlations. For example, the correlation between x1 and y is accounting for variance also predicted by x2. However, this problem can be corrected for mathematically. There are three types of correlations which are involved in prediction and regression:
  • Zero-Order Correlation: This is the relationship between two variables, while ignoring the influence of other variables in prediction. In the diagrammed example above, the zero-order correlation between y and x2 calculates the variance represented by sections 1 and 2, while the variance of sections 3 and 4 remain part of the overall variances in x1 and y respectively. This is the cause of the redundancy problem because a simple correlation does not account for possible overlaps between independent variables.
  • Partial Correlations: This is the relationship between two variables after removing the overlap completely from both variables. For example, in the diagram above, this would be the relationship between y and x2, after removing the influence of x1 on both y and x2. In other words, the partial correlation determines the variance represented by section 1, while the variance represented by sections 2, 3, and 4 are removed from the overall variances of the variables. Below is the formula for calculating a partial correlation:

Picture (261x65, 1.6Kb)

  • Part (Semi-Partial) Correlations: This is the relationship between two variables after removing a third variable from just the independent variable. In the diagram above, this would be the relationship between y and x2 with the influence of x2 removed from x1 only. In other words, the part correlation removes the variance represented by sections 2 and 4 from x2, while sections 2 and 3 are not removed from y. The formula is as follows:

    Picture (169x65, 1.4Kb)

    Note that because variance is removed from y in the partial correlation, it will always be larger than the part correlation. Also note that since the part correlation can account for more of the variance without ignoring overlaps (like the partial correlation), it is more suitable for prediction when redundancy exists. Therefore, the part correlation is the basis of multiple regression.

 
 
While bivariate regression utilizes a regression line as the basis of prediction for Y, multiple regression utilizes a three dimensional plane (in the two predictor case). Hence, the formula simply adds terms for each predictor with each term having its own coefficient. Once again, the Least Squares Criterion is used to minimize the error of prediction. In this case, the "weights" are known as unstandardized regression coefficients but can be expressed as standardized regression coefficients by converting to z-scores by dividing the standard deviation of y by the standard deviation of x , and multiplying this by the unstandardized coefficient.

When this is done, the intercept drops out of the regression formula. The standardized weights are, in effect, part correlations where the other predictors are removed from each other. In this way, the regression formula accounts for the maximum amount of variance that can be predicted.

Overall, the MR coefficient [multiple R] can be interpreted like a Pearson's correlation coefficient. In other words, R Squared is the percent of Y variance accounted for by the predictors. The formula for the multiple correlation in the two predictor case is as follows:

Picture (427x70, 1.9Kb)

 
 
The test of significance for R is as follows:

Picture (498x60, 2.1Kb)

where N = number of subjects and k = number of predictors

What does significance of R mean in terms of prediction?

  • As a group, the set of predictors accounts for significance variance in y.
  • At least one independent variable alone accounts for a significant amount of variance.
  • R is significantly different from the value specified in the null hypothesis (typically zero).
 
 
Once it is determined that the overall set of predictors is significant, it is usually of interest to know which variables account for the most variance in Y. There are three basic indices of relative contribution of a variable:
  • Zero-order Correlations: These are essentially the correlations between a particular predictor and Y. These correlations, however, are very inadequate representations of the variable's unique ability to predict Y. (Remember the earlier discussion about correlations?)
  • Standardized Beta Weights: Those variables which have the largest absolute values of weights are those that strongly predict Y. However, since the weights are mathematically determined, they may not completely capture the true relationship between the variables. Also, shrinkage becomes a problem; the weights may be optimal for this sample, but will most assuredly lead to a smaller R Squared when applied to another sample.
  • Darlington's Usefulness Criteria: Usefulness is defined as the amount R Squared would drop if a variable were left out of the equation and R Squared were calculated with the just the other variables. If R Squared drops considerably, then x is a useful predictor.
  • Incremental Validity of a Variable: Would the addition of a new predictor significantly enhance our predictive abilities? This can be determined by the following formula:

Picture (480x90, 2.5Kb)

 
 
Remember how the predictors give redundant information in the prediction of Y? This is the cause of an important methodological consideration when it comes to selecting which variables should be used in the MR equation. For example, a predictor entered late in the equation may contribute very little in terms of prediction because all previous predictors accounted for the variance. However, if the predictor had been entered first, it may have accounted for all that variance and the others may not have contributed anything above and beyond it.

There are two general categories of variable entry methods used with MR:

  • Simultaneous Entry: With this method, all variables are entered at the same time and the Beta weights are determined simultaneously. It focuses on the unique contributions of each variable and shared variance is ignored. This is generally used when all predictors were intended to be used and there is no theoretical reason to consider a subset of predictors.
  • Sequential (Hierarchical) Entry: This is typically used to build a subset of predictors. There are two major ways of determining the order in which variables should be entered into or removed from the equation:

(1) Apriori: Literally means determined beforehand. Variables are entered in the order determined by some theory.

(2) Statistical Criteria: The computer decides the order in which variables are entered based on their unique predictive abilities. There are three of these methods.

a. Forward Inclusion: For this strategy, predictor variables are selected for inclusion into the MR equation only if they meet certain statistical criteria. The order in which these variables are entered are entirely determined by these statistical criteria. The predictor which explains the greatest amount of Y variance is entered first (i.e. the highest zero-order correlation); the variable that explains the greatest amount of Y variance not already accounted for is included next. This continues until the entry of any remaining variable does not significantly improve the prediction. It is possible that some variables are never entered.

b. Backward Exclusion: This method is similar to the previous method. First, all the variables are entered into the equation. Then the variable that is the worst predictor of Y is removed, and this continues until there is a significant decrease in R Squared.

c. Stepwise Solution: Stepwise methods are identical to forward inclusion methods combined with the feature that a predictor variable, once included in the equation, may later be removed if it should lose its predictive power. This loss of power can occur because some of the variable's information becomes redundant with the newer variable.

 
 
Despite its versatility, multiple regression does make assumptions about the nature of the relationships between variables:
  • Linearity: Since it is based on linear correlations, multiple regression assumes linear bivariate relationships between each x and y, and also between y and y'. However, with special techniques, MR can be used to model nonlinear relationships, something that will be described in the next section.
  • Normality: Multiple regression assumes that both the univariate and the multivariate distributions of residuals (actual scores minus predicted scores) are normally distributed.
 
 
As described earlier, shrinkage occurs because MR capitalizes on mathematical derivations of the sample; beta weights are determined using the least-squares criterion and will likely not apply to a new sample very well. This has three basic causes:
  • Low N:k Ratio: It is optimal in research to have a sufficient number of participants for each predictor. When the number of participants is low relative to the number of predictors (below 20:1), sample estimates may not predict the population.
  • Multicollinearity: While MR is designed to handle correlations between variables, high correlations between predictors can cause unstability of prediction. If the intercorrelations between predictor sets become extremely high (~.8), the standard errors of the beta weights become infinitely large, suggesting that it will be highly unlikely that the present findings can be applied to another sample (i.e. replicate our findings).
  • Measurement Error: If the measurement of a predictor does not reflect a true score, the application of Beta weights to a new sample may not be accurate.

Shrinkage can be handled in three basic ways:

  • Shrinkage Formulas: Formulas exist for estimating the amount of shrinkage that can occur in a particular sample.
  • Cross-Validation Studies: If the concern is how accurate Beta weights are when applied to a new sample, why not just get another sample and apply the original weights? This will give an indication of how much R will shrink.
  • Apriori Weights: Shrinkage is not a problem when weights are determined beforehand.