Working with categorical variables

Most anyone working with any kind of data will have no trouble with binary outcomes (for example, case vs. control) and with relating them to continuous variables such as gene expression profiles. Indeed, the Student t-test or simple linear regression are some of the first topics encountered in data analysis. Categorical outcomes that encode more than two possible groups or values can be more of a challenge: although there are statistical tests such as the Kruskal-Wallis test that test whether a continuous variable has the same mean across multiple groups, the analyst will usually want to know for which pairs of the groups the means differ. For example, if the outcome can control or one of 3 diseases, one would like to know not only that a gene expression differs between some of the 4 outcomes, but also which ones.

To make the following more concrete, imagine the outcome (call it O) is a categorical variable with M different unique values. Each unique value is also called a “level”. We’ll label the levels L1, L2, …, LM. The levels could, for example, be “Control” and 3 different types of disease (M=4). One might like to know which genes (or WGCNA modules) differ in their expression between the various dieseases and controls, between pairs of diseases, or which expression changes are unique to each disease.

The easiest approach is to create binary indicators for the various comparisons (I will call them contrasts). There are two types of such indicators. The first, call it pairwise indicator, represents the contrast of two levels. A pairwise indicator for levels say L1 and L2 would be called L2 vs. L1 and equals 0 for all samples with outcome L1, equals 1 for all samples with outcome L2, and NA (missing value) otherwise. The values 0 and 1 are somewhat arbitrary (one could just as well use say 1 and 2) but it is advantageous to choose them so that their difference is 1. Note the naming convention: the level with the larger value is the first mentioned in the name. This ensures that when the coefficient (usually interpreted as log-fold change) in a regression model of a variable (gene expression or module eigengene) on the indicator L2 vs. L1 is positive, the variable is higher in L2 vs. L1 samples.

The second type of binary indicators contrasts a level vs. all others. The indicator for level L1 could be called L1 vs. all others and equals 1 for all samples with outcome L1 and 0 otherwise. With M levels one can create M indicators.

WGCNA implementation

As of September 2018, the WGCNA R package can binarize a categorical variable using the function binarizeCategoricalVariable. The function takes a whole lot of arguments that let the user specify precisely how and which indicators should be built and how to call them. The R code below illustrates its use. (Copy and paste the code into an R session to try it out!)

# Define a categorical variable with 3 levels
x = rep(c("A", "B", "C"), each = 3);
# Binarize it into pairwise indicators
out = binarizeCategoricalVariable(x,
includePairwise = TRUE,
includeLevelVsAll = FALSE);
# Print the variable and the indicators
data.frame(x, out);

The code above prints a data frame consisting of the original variable x and the 3 pairwise binary indicators based on it:

  x B.vs.A C.vs.A C.vs.B
1 A      0      0     NA
2 A      0      0     NA
3 A      0      0     NA
4 B      1     NA      0
5 B      1     NA      0
6 B      1     NA      0
7 C     NA      1      1
8 C     NA      1      1
9 C     NA      1      1

The next code chunk illustrates creation of indicators for each level vs. all others:

out = binarizeCategoricalVariable(x,
includePairwise = FALSE,
includeLevelVsAll = TRUE);
# Print the variable and the indicators
data.frame(x, out);

And here’s what R outputs: again a data frame with the variable x and the (again 3) level vs. all others indicators.

  x A.vs.all B.vs.all C.vs.all
1 A        1        0        0
2 A        1        0        0
3 A        1        0        0
4 B        0        1        0
5 B        0        1        0
6 B        0        1        0
7 C        0        0        1
8 C        0        0        1
9 C        0        0        1

In my usual workflows, sample characteristics are typically contained in data frames and I need often need to binarize several of them at once. Here I use function binarizeCategoricalColumns which applies the binarization to selected columns of a data frame. Finally, there are convenience wrappers named binarizeCategoricalColumns.forRegression, binarizeCategoricalColumns.pairwise and binarizeCategoricalColumns.forPlots that provide binarization for some of the most common tasks that I encounter in my analyses.

Functions binarizeCategoricalColumns.forPlots and binarizeCategoricalColumns.forRegression create indicators for level vs. all others. binarizeCategoricalColumns.forRegression drops the indicator for the first level since it is not linearly independent from the rest (when the intercept column is included as well, as it normally is in regression analysis). Apart from the intercept term, binarizeCategoricalColumns.forRegression does essentially the same as model.matrix with a simple 1-variable design. The binarizeCategoricalColumns.forRegression function allows the user to specify the order of the levels, which requires an extra step when using model.matrix(). As its name suggests, binarizeCategoricalColumns.pairwise creates binary indicators for level contrasts.