Part IV: Correlation and Regression
Overview of Regression

 
 
Suppose we are trying to predict some continuous Y variable from X and we obtain the scatterplot to the right. We wish to construct the "best fitting" prediction line which captures the linear relationship between X and Y and allows us to predict Y from a given X score. We wish to form this prediction line such that we make as few prediction errors as possible. An error in prediction is defined as the difference between each subject's actual score and the predicted Y score obtained from X via the prediction line.

Picture (200x200, 1.9Kb)Picture (200x200, 2Kb)

The regression line which minimizes errors of prediction for the whole sample will have the following formula, including a breakdown for the slope (b) and the intercept (a):

Picture (434x48, 1.4Kb)

Note that the slope of the regression line is highly related to the correlation coefficient. Thus, we can obtain the prediction line from each X value via:

Picture (397x112, 2.7Kb)

Furthermore, if X and Y are expressed as standard scores (i.e., we have converted raw scores for X and Y to z scores), then using a little algebraic massaging:

Picture (245x61, 1.5Kb)

 
 
The correlation coefficient is an index of the magnitude or strength of the linear relationship between X and Y. The coefficient ranges from -1.00 to +1.00, where a correlation equal to +1.00 means a perfect positive linear relationship between X and Y, while a correlation equal to -1.00 means a perfect negative relationship. Note: When the absolute value of the correlation is 1.00, all y values fall on the regression line.

If we square rxy, we get the proportion of Y variance that is explained or accounted for by X. Another way to express this is:

Picture (463x106, 4.5Kb)