Part IV: Correlation and Regression
Other Regression Topics

 
 
Suppose we have several separate samples each with its own correlation value. If we want to combine this data in some way to yield a single estimate of the population correlation, we could do one of the following two things:
  • Pool all of the samples into one group and calculate the correlation pooled across groups.
  • Take an average of the correlations from the separate samples.

Picture (300x200, 3.4Kb)Approach 1 is always inappropriate because the regression line for the pooled sample may be quite different from the regression lines within each separate group. For example, suppose we have 3 groups where the correlation is zero within each group as diagrammed:

Now, suppose we collapse the groups into one large sample and recalculate the correlation. Our value for the correlation would now be large and positive, which inaccurately represents the X and Y relationship within each sample.

Picture (272x198, 3.2Kb)Rather, we should get an average of the individual groups. Here we can use Firsher's r to z transformation for each correlation, and then weight correlations appropriately using:

Picture (275x59, 2Kb)

 
 
We can use Fisher's r-z transformation to calculate the confidence interval around the correlation coefficient. First, we would transform the correlation using either the formula or the table discussed earlier. Next the confidence interval is calculated as:

Picture (178x47, 1.4Kb)

Finally, we would tranform these correlations back using the same r to z method in order to obtainn our interval in "correlation units."

 
 
The following factors directly affect the size of the correlation coefficient.
  • Picture (300x200, 2.7Kb)Extent of nonlinearity in the X-Y relationship: A correlation reflects only the linear relationship between X and Y. For example, if we fit a regression line to this scatterplot, the correlation coefficient would be low despite the fact that there is a strong curvilinear relationship. This is demonstrated in the drawing at the right, where the regression line does not accurately describe the nature of the relationship.
  • Nature of the Distributions: In order for the coefficient to be 1.00, the separate distributions (marginal distributions) of X and Y must be identical.
  • Picture (300x200, 3.9Kb)Amount of range restriction: If we calculate the correlation coefficient for only a small segment of some population, we will likely reduce the coefficient for that sample because we have restricted the range of variance on X and Y. Consider the drawing at the right, where the small sample in no way represents the relationship in the entire sample. Restricted range may have an unpredictable effect on the regression line.
  • Reliability of X and Y: Measurement error on X and Y will reduce the size of the correlation coefficient. The larger the measurement error the lower the possible calculated correlation coefficient.