Part VII: Multiple Regression (MR)
Using Categorical Variables in MR

 
 
Dummy coding is used when categorical variables (e.g. sex, geographic location, ethnicity) are of interest in prediction. Dummy codes are a series of numbers assigned to indicate group membership in any mutually exclusive and exhaustive category. The full range of uses of dummy coding schemes in MR include:
  • The entry of group membership into an MR equation as a predictor variable.
  • Representing any ANOVA design and model interactions with an MR equation.
  • The ability to cope with missing data problems (particularly with ANOVA designs).
  • Conducting an ANCOVA analysis with MR (ANCOVA will be discussed later).
  • Representing the RMD problem for the MANOVA procedure (discussed much later).
 
 
There are five general rules/guidelines for the construction of dummy codes:
  • Number of Dummy Variables: The number dummy coded variables necessary to represent a single categorical variable is equal to the number of degrees of freedom available for the categories. Degrees of freedom are determined by the number of groups minus one. For example, sex can be represented by a single dummy code because there are only two alternatives. This, however, becomes more complicated in a situation such as marital status, where there the possibilities of single, married, divorced, separated, or widowed (df = 4). This feature allows for independent analysis of each group; however, each dummy code in uninterpretable by itself and must be considered in conjunction with all other variables designed to represent that set of categories.
  • Coding Values: In most applications of dummy codes, the individual codes will consist of a string of 0's, 1's, and -1's. Also, each string will typically add up to 0.
  • Redundancy: For a certain category system being represented, none of the dummy codes (number of groups minus one) that are constructed can be completely redundant. That is, one dummy code can not be a simple constant multiple of another.
  • Orthogonality: It is not necessary to create orthogonal (independent) weights for a set of dummy codes. While this means that dummy codes will be correlated, their redundancy will be corrected by multiple regression.
  • Interaction: Interaction effects of 2 categorical variables (e.g. a joint effect of A and B on Y in a factorial design) are represented by dummy codes, which are simply the products of the dummy codes separately constructed for A and B.
 
 
Keeping these general rules/guidelines for dummy variables in mind, a number of different coding schemes can be utilized depending on what type of relationship between the groups is of interest. In general, there are four types of codes that are of use: dummy, effect, contrast, and trends. Remember that each participant will receive a score on each dummy variable, with the score depending on the group to which the individual belongs. For the examples below, the number of groups is equal to four; therefore, groups are denoted by A1, A2, A3, and A4, while the codes are labeled a1, a2, and a3.
Group Dummy Codes Effect Codes Contrast Codes Trend Codes
 
a1 a2 a3
a1 a2 a3
a1 a2 a3
a1 a2 a3
A1
A2
A3
A4
1 0 0
0 1 0
0 0 1
0 0 0
1 0 0
0 1 0
0 0 1
-1 -1 -1
3 0 0
-1 2 0
-1 -1 1
-1 -1 -1
-3 1 -1
-1 -1 3
1 -1 -3
3 1 1

Thus, each individual in group A1 would receive the same coding as all other individuals in that group. Note the similarity between the individual contrast and trend codes and the general types of planned comparisons used in simpler statistical designs.

The same R Squared will be obtained regardless of which of the above schemes is used. The dummy and effect codes are easy to generate; they do not, however, produce orthogonal dummy variables which requires both the sums and the sums of cross products of the weights to be zero. The contrast and trend codes, on the other hand, do (for equal group sizes) result in uncorrelated dummy variables.

 
 
Remember that the use of nominal scale data in prediction requires the use of dummy codes; this is because data needs to be represented quantitatively for predictive purposes, and nominal lacks this quality. Once the data is coded properly (for this, use a text or a more established source), the analysis can be interpreted in a manner similar to traditional ANOVA. In fact, the sum of squares for the between group effect (e.g. religion) and the sum of squares for error are related to the proportion of variance accounted for:

Picture (305x65, 1.8Kb)

Thus, the F Ratio can be calculated and considered as before. This formula is similar to the significance of R2 formula except that k represents the number of coded vectors (number of groups minus one):

Picture (279x60, 1.6Kb)

 
 
When MR is used in this instance, it is called a least squares ANOVA. When the number of subjects per cell are equal, the results given by MR and traditional ANOVA are identical. In the factorial design, each variable must be coded in accordance to the requirements set above. Thus, each variable requires a number of dummy codes.

In addition, interactions now need to be considered. Since the interaction has only one degree of freedom (in the 2 x 2 ANOVA case), it can be represented with just one dummy code. The single dummy code is obtained by simply multiplying the effect codes together for each subject. In other words, the code for the interaction is the product of the appropriate codes for group membership.

Once again, the proportion of variance accounted for is directly related to the observed effects:

Picture (349x145, 2.6Kb)

 
 
When cell size is unequal in traditional ANOVA, orthogonality does not exist between effects; in other words, an analysis of the effects yields redundant information. This happens because the main effects and the interactions all become correlated--a problem well-suited to multiple regression. As already noted, the technique of dummy coding allows the use of categorical variables, and the analysis of factorial designs is relatively easy.

However, within MR there are various methods available for the elimination or correction of redundancy between main and interaction effects. This is basically the problem of selecting a variable entry strategy to estimate an "unconfounded" effect, which removes variance that can be accounted for by other effects. In all cases, the error term remains the same as before. There are three such methods:

  • Simultaneous (Complete) Correction: As with the simultaneous variable entry strategy, all effects are entered as a block and all terms are corrected for redundancy with all other terms. This is the most conservative and most common technique.
  • Classic Experimental Approach: This is a less conservative approach to correction. All main effects are corrected for all other main effects, while all interactions are corrected for other interactions. Thus, the main effect for A removes variance accounted for by B, and estimating B corrects for redundancy with A.
  • Hierarchical Approach: The hierarchical approach assumes a priori notions about variable importance among the main effects and a hierarchy of importance of correction is established. The most important independent variable is estimated first without correcting for redundancy with any other effects, and the second most important variable is corrected for correlation with only the first. The third entered is corrected for the first two. This then continues until all are entered. Therefore, the first variable entered is most likely to achieve significance.