Removal of unwanted variation based on a subset of samples, with R code

Batch effects, technical variation and other sources of unwanted or spurious variation are ever-present in big data, especially so in high-throughput molecular (gene expression, proteomic or methylation) profiling. Fortunately, multiple methods exist for removing such variation that are suitable for various situations one may encounter. When a source of unwanted variation is known and can be represented by a numeric or categorical variable (such as batch), the variation can be adjusted for using regression models. The best-known method for molecular data, ComBat, works for categorical confounders only; primarily because of this limitation, I implemented a similar, empirical Bayes moderated linear model adjustment in function empiricalBayesLM in WGCNA. This function works for both categorical and continuous confounders, but only adjusts the mean, while ComBat also adjusts variation.

Whether working with categorical or numeric confounders, it is sometimes desirable to adjust the data such that only a subset of samples are used for the mean and possibly variance equalization. As an example, one may wish to merge two data sets, generated by different authors, in different labs and at different times, each containing control and disease samples, with a different disease in each set. Since molecular profiles of the two diseases may be different, it does not make sense to include these samples in the mean and variance adjustment; the adjustment coefficients (shifts and scalings) should be based only on the control samples but should be applied to all samples. For simplicity, assume that the only (strong) source of technical variation is the batch effect between the two sets.

In the WGCNA function empiricalBayesLM, the user can use the argument fitToSamples to specify which samples should be used in determining the adjustment coefficients; in the example of two sets, one would supply the indices of the control samples from both sets. The disease samples do not enter the calculation of coefficients but are then adjusted together with the control samples.

Although ComBat does not let the user specify a subset of samples to which the models should be fit, one can use the argument ‘mod’ (the model matrix of outcome of interest and other covariates) to achieve substantially the same effect. One could also use this approach with empiricalBayesLM, but there is a price to pay: including the disease variables in the model means their effect is first regressed out (together with the batch effect), then added back to the data (without the batch effect, of course). This can sometimes introduce artifacts and inflate significance in downstream analyses, as pointed out in a 2016 article by Nygaard and collaborators.

I will illustrate the two different approaches on simulated data. I will use a very simple simulation with normally distributed variables; the aim is to illustrate the batch correction, not to simulate realistic (say gene expression) data. The first chunk of code sets up the R session and simulates the two “data sets”. Copy and paste the code below into an R session (make sure you have packages sva and WGCNA installed) to try it yourself.

library(sva)
library(WGCNA)
options(stringsAsFactors = FALSE);

nGenes = 1000;
nSamples1 = 8;
nSamples2 = 12;
disEffect = 0.5;
batchEffect = 0.4;
set.seed(2);
# Simulate first data set, genes in columns
data1 = matrix(rnorm(nSamples1 * nGenes), nSamples1, nGenes)
annotation1 = data.frame(
      status = sample(c("Ctrl", "Disease1"), nSamples1, replace = TRUE));
# Add a global effect of disaese
dSamples1 = annotation1$status=="Disease1"
data1[ dSamples1, ] = data1[ dSamples1, ] + 
      disEffect * matrix(rnorm(nGenes), sum(dSamples1), nGenes, byrow = TRUE)
# Simulate second data set
data2 = matrix(rnorm(nSamples2 * nGenes), nSamples2, nGenes)
annotation2 = data.frame(
     status = sample(c("Ctrl", "Disease2"), nSamples2, replace = TRUE));
# Add a global effect of disaese
dSamples2 = annotation2$status=="Disease2";
data2[ dSamples2, ] = data2[ dSamples2, ] + 
      disEffect * matrix(rnorm(nGenes), sum(dSamples2), nGenes, byrow = TRUE)

# Add a batch effect to second data set: shift each gene by a random amount
data2 = data2 + batchEffect * matrix(rnorm(nGenes), nSamples2, nGenes, byrow = TRUE)

# Prepare a function to plot principal components since we will use it a few times
plotPCA = function(data, annotation, ...)
{
  svd1 = svd(data, nu = 2, nv = 0);
  status = annotation$status;
  ctrlSamples = status=="Ctrl"
  status[ctrlSamples] = paste0(status[ctrlSamples], annotation$batch[ctrlSamples])
  layout(matrix(c(1:2), 1, 2), widths = c(0.3, 1))
  par(mar = c(3.2, 0, 0, 0));
  plot(c(0, 1), type = "n", axes = FALSE, 
       xlab = "", ylab = "");
  legend("bottomright", legend = sort(unique(status)), 
         pch = 20 + c(1:length(unique(status))), 
         pt.bg = labels2colors(1:length(unique(status))), 
         pt.cex = 2)
  par(mar = c(3.2, 3.2, 2, 1))
  par(mgp = c(2, 0.7, 0))
  plot(svd1$u[, 1], svd1$u[, 2], 
       xlab = "PC1", ylab = "PC2", 
       pch = 20 + as.numeric(factor(status)), 
       cex = 2.5, bg = labels2colors(status), cex.main = 1.0, ...)

}

The next chunk is where the common analysis starts. The two sets are merged and I plot the first two principal components of the merged data.

# Merge the two data sets
data = rbind(data1, data2)
# Include a batch variable in the combined sample annotation
annotation = data.frame(
    rbind(annotation1, annotation2), 
    batch = c(rep(1, nSamples1), rep(2, nSamples2)))
# Plot PCs of the merged data
plotPCA(data, annotation, 
    main = "Merged data before batch correction"

The samples separate cleanly into 4 groups; the difference between the two control groups is the batch effect.

# Correct for the batch effect using empiricalBayesLM from WGCNA
data.eblm = empiricalBayesLM(
   data, 
   removedCovariates = annotation$batch, 
   fitToSamples = annotation$status=="Ctrl")$adjustedData;
# See how well the correction worked
plotPCA(data.eblm, annotation, 
        main = "Corrected by regression on control samples")

A residual batch effect is still apparent and can be removed using ordinary linear regression rather than empirical Bayes-moderated one, at the expense of possibly introducing some over-fitting. Since empiricalBayesLM also returns data corrected using plain linear models, the code is very similar. The component adjustedData of the output of empiricalBayesLM is simply replaced by adjustedData.OLS, OLS standing for ordinary least squares:

data.lm = empiricalBayesLM(data, removedCovariates = annotation$batch, 
   fitToSamples = annotation$status=="Ctrl")$adjustedData.OLS;
# See how well the correction worked
plotPCA(data.lm, annotation, 
        main = "Unmoderated regression on control samples")

Using plain linear models indeed removed the batch effect completely.

It’s ComBat‘s turn now. I first show how a naive application of ComBat to correct the batch effect, without regard for disease status, leads to incorrect batch correction.

data.ComBat.naive = t(ComBat(t(data), batch = annotation$batch))
plotPCA(data.ComBat.naive, annotation, 
        main = "Naive application of ComBat")

A major separation remains between the two groups of control samples, certainly not what one would like or expect from a successful batch correction. The correct approach is to include covariates of interest that encode the disease status in two separate indicator variables, one for each disease.

data.ComBat = t(ComBat(t(data), batch = annotation$batch, 
                       mod = model.matrix(~status, data = annotation)))
plotPCA(data.ComBat, annotation, 
        main = "Correct application of ComBat")

The result looks very much like the correction using empiricalBayesLM above, including the residual batch effect, but the results are not exactly the same – the approaches are in some ways equivalent but do not yield identical results. The upshot is that one can use both empiricalBayesLM and ComBat to achieve similar effect. The advantage of empiricalBayesLM is that it is more flexible: One can specify the samples to which the linear models should be fit rather than have to include extra covariates of interest. Since empiricalBayesLM avoids the two-step “regress out-add back in” procedure and its potential instabilities, one can also use plain linear model for adjustment to remove the batch effect completely. On the flip side, if the number of suitable samples is low (such as if there are only a few control and many disease samples), it may be better to include all samples and the additional covariates in the calculation, perhaps introducing a bit of a systematic bias in exchange for less (quasi-)random variation.

Working with categorical variables

Most anyone working with any kind of data will have no trouble with binary outcomes (for example, case vs. control) and with relating them to continuous variables such as gene expression profiles. Indeed, the Student t-test or simple linear regression are some of the first topics encountered in data analysis. Categorical outcomes that encode more than two possible groups or values can be more of a challenge: although there are statistical tests such as the Kruskal-Wallis test that test whether a continuous variable has the same mean across multiple groups, the analyst will usually want to know for which pairs of the groups the means differ. For example, if the outcome can control or one of 3 diseases, one would like to know not only that a gene expression differs between some of the 4 outcomes, but also which ones.

To make the following more concrete, imagine the outcome (call it O) is a categorical variable with M different unique values. Each unique value is also called a “level”. We’ll label the levels L1, L2, …, LM. The levels could, for example, be “Control” and 3 different types of disease (M=4). One might like to know which genes (or WGCNA modules) differ in their expression between the various dieseases and controls, between pairs of diseases, or which expression changes are unique to each disease.

The easiest approach is to create binary indicators for the various comparisons (I will call them contrasts). There are two types of such indicators. The first, call it pairwise indicator, represents the contrast of two levels. A pairwise indicator for levels say L1 and L2 would be called L2 vs. L1 and equals 0 for all samples with outcome L1, equals 1 for all samples with outcome L2, and NA (missing value) otherwise. The values 0 and 1 are somewhat arbitrary (one could just as well use say 1 and 2) but it is advantageous to choose them so that their difference is 1. Note the naming convention: the level with the larger value is the first mentioned in the name. This ensures that when the coefficient (usually interpreted as log-fold change) in a regression model of a variable (gene expression or module eigengene) on the indicator L2 vs. L1 is positive, the variable is higher in L2 vs. L1 samples.

The second type of binary indicators contrasts a level vs. all others. The indicator for level L1 could be called L1 vs. all others and equals 1 for all samples with outcome L1 and 0 otherwise. With M levels one can create M indicators.

WGCNA implementation

As of September 2018, the WGCNA R package can binarize a categorical variable using the function binarizeCategoricalVariable. The function takes a whole lot of arguments that let the user specify precisely how and which indicators should be built and how to call them. The R code below illustrates its use. (Copy and paste the code into an R session to try it out!)

library(WGCNA)
# Define a categorical variable with 3 levels
x = rep(c("A", "B", "C"), each = 3);
# Binarize it into pairwise indicators
out = binarizeCategoricalVariable(x,
includePairwise = TRUE,
includeLevelVsAll = FALSE);
# Print the variable and the indicators
data.frame(x, out);

The code above prints a data frame consisting of the original variable x and the 3 pairwise binary indicators based on it:

  x B.vs.A C.vs.A C.vs.B
1 A      0      0     NA
2 A      0      0     NA
3 A      0      0     NA
4 B      1     NA      0
5 B      1     NA      0
6 B      1     NA      0
7 C     NA      1      1
8 C     NA      1      1
9 C     NA      1      1

The next code chunk illustrates creation of indicators for each level vs. all others:

out = binarizeCategoricalVariable(x,
includePairwise = FALSE,
includeLevelVsAll = TRUE);
# Print the variable and the indicators
data.frame(x, out);

And here’s what R outputs: again a data frame with the variable x and the (again 3) level vs. all others indicators.

  x A.vs.all B.vs.all C.vs.all
1 A        1        0        0
2 A        1        0        0
3 A        1        0        0
4 B        0        1        0
5 B        0        1        0
6 B        0        1        0
7 C        0        0        1
8 C        0        0        1
9 C        0        0        1

In my usual workflows, sample characteristics are typically contained in data frames and I need often need to binarize several of them at once. Here I use function binarizeCategoricalColumns which applies the binarization to selected columns of a data frame. Finally, there are convenience wrappers named binarizeCategoricalColumns.forRegression, binarizeCategoricalColumns.pairwise and binarizeCategoricalColumns.forPlots that provide binarization for some of the most common tasks that I encounter in my analyses.

Functions binarizeCategoricalColumns.forPlots and binarizeCategoricalColumns.forRegression create indicators for level vs. all others. binarizeCategoricalColumns.forRegression drops the indicator for the first level since it is not linearly independent from the rest (when the intercept column is included as well, as it normally is in regression analysis). Apart from the intercept term, binarizeCategoricalColumns.forRegression does essentially the same as model.matrix with a simple 1-variable design. The binarizeCategoricalColumns.forRegression function allows the user to specify the order of the levels, which requires an extra step when using model.matrix(). As its name suggests, binarizeCategoricalColumns.pairwise creates binary indicators for level contrasts.

Functional enrichment analysis via R package anRichment

At some point in most any analysis of high-throughput data one wants to study enrichment of a resulting set (or sets) of genes in predefined reference gene sets. Although there are many tools out there that let the user evaluate enrichment in standard reference sets such as GO and KEGG, there are relatively few that allow the user to build, store and reuse custom collections. For these and other reasons I put together an R package called anRichment that allows one to run standard enrichment calculations against the usual collections of reference gene sets such as GO, KEGG and others, as well as against custom gene lists such as ones available through the use of userListEnrichment function in the WGCNA package.

It all started around 2009 or so, when the standard way of studying functional enrichment was to upload relevant gene lists one by one to a web server such as DAVID, and download the resulting tables. This works fine for one analysis with just a few gene lists but is not really suitable for automating analyses or even just to trying out several different sets of parameters for WGCNA (uploading 20-30 modules after every parameter change gets tiresome real fast). After looking through then-available Bioconductor packages, I did not find anything that suited my needs, so I wrote my own GO enrichment function GOenrichmentAnalysis, still available but now deprecated in WGCNA. Around the same time, neuroscientist Jeremy A. Miller collected multiple brain-related reference gene sets from published literature and wrote the function userListEnrichment to study enrichment of input gene sets in his collection of brain gene sets. (In case you’re wondering, the function was published in this article.)

Although running automatable R code was a big improvement on uploading modules to DAVID every time something in an analysis changed, it eventually became clear to me that we need a unified enrichment calculation function for both types of reference gene sets; having two separate functions is inconvenient and at the very least makes it necessary to re-run multiple testing correction after the results have been combined.

Another area that I didn’t see addressed in the software packages available then was the ability to define groups of individual gene sets based on their origin (say tissue, technology etc), interpretation, or any other characteristics that may be relevant. One could then restrict a large collection of say all brain sets to say just cortex-related sets, or just disease-specific sets etc. Since a gene set could belong to many groups, one could also think of the groups as tags.

Around mid-2014, I put together the first versions of an R package for annotation and enrichment calculations, and eventually called it anRichment (for, well, annotation and enrichment, with a capital R to emphasize R language in which it is written). Over time the package as well as number of gene sets in it grew and the newest and greatest version is actually split into two packages, anRichmentMethods for the functions, and anRichment itself for data and accessor functions. It is best to think of the two essentially as one package, split only so one would not have to download and reinstall several tens of megabytes worth of data every time a function changes or gets added.

What does anRichment do?

The packages aim to do a few things I found useful in my own work:

  • Collect interesting gene sets in an organized, tagged collection for relatively easy retrieval and manipulation. This includes functions for creating custom gene sets and annotating them with tags.
  • Combine standard databases of functional gene sets such as GO, KEGG and others with custom collections of gene sets in a unified structure allowing equal treatment and use.
  • Calculate enrichment of query sets in reference gene sets and output all relevant statistics in a convenient format. At present enrichment is evaluated using Fisher exact test only.
  • Provide supporting functions for the multitude of smaller tasks that often crop up when collecting gene sets or calculating enrichment.

To help users get started, I wrote an introductory tutorial. It contains a simple example calculation of enrichment of WGCNA modules in GO, sketches out some of the more advanced capabilities of the package and provides information to hackers who would like to tinker with the existing code. Worth checking out! (Says the author :))

Collections available in anRichment

The package either stores or provides access to multiple collections of reference gene sets:

  • GO, KEGG, NCBI BioSystems pathways: The starting point of most enrichment calculations. GO sets are accessed though Bioconductor annotation packages while KEGG and other components of the NCBI BioSystems pathway database are stored internally.
  • Internal collection: The original collection of gene sets collected by Jeremy A. Miller while he was a PhD student at UCLA. It contains brain and blood-related gene sets from various published articles.
  • HD Signatures Database (HDSigDB): A collection of gene sets directly or indirectly related to Huntington’s disease (HD). This collection is maintained by Rancho Biosciences under contract from CHDI, Inc. The HDinHD portal contains detailed descriptions of the gene sets. HDinHD requires free registration to access the data.
  • Miller AIBS collection: More gene sets collected by Jeremy A. Miller up to about 2014. Contains brain development-related gene sets, transcription factor targets and others.
  • HD Target DB: HD-related gene sets collected originally by Michael Palazzolo and Jim Wang for CHDI. Contains functional sets compiled from literature as well as textbooks, gene sets from HD perturbation studies, protein-protein interactor sets and others.
  • Neurogenomic sets collected by X. William Yang and members of his lab: Another collection of gene sets curated from published articles that people in Yang lab found useful in their research.
  • Positional gene sets: Each gene set contains genes within a certain window around a given genomic position. These sets are generated dynamically from Bioconductor annotation packages.
  • Molecular Signatures Database: anRichment provides a function that converts the Molecular Signatures Database (MSigDB) in XML file format into an anRichment collection. Users wishing to use MSigDB need to obtain the XML file from Broad Institute.

But wait, there’s more! Several additional packages provide additional collections that collect WGCNA modules from my own analyses of HD data and brain-related gene sets culled from the literature by our friends at Verge Genomics.

The reader has by now surely noticed that most custom collections in anRichment focus on neuroscience in general and Huntington’s disease in particular. Well, that mirrors the focus of my own work and I suppose it will make anRichment, as it stands now, most useful to the neuroscience community. I surely hope that people in other fields who have their own favorite collection of literature gene sets find anRichment useful and perhaps share their collection or collections with the wider world.

Filtering and collapsing data

I wrote recently about the “blockwise” approach that allows the WGCNA package to analyze large data with modest computational resources. This is all nice and well, but it often makes sense to reduce the number of variables in the data set before even starting the analysis. The simplest reduction is to filter out the variables least likely to be informative. Another option, available when multiple variables represent the same functional unit, is to collapse them such that each unit is represented by a single variable.

Filtering out uninformative variables

For microarray and RNA-seq data, I tend to filter out low-expressed variables (probes or genes). The rationale is that in most tissues, one can expect at most 10,000 to 15,000 genes to be actually expressed; hence, several thousand of the lowest-expressed genes are likely simply measurement noise. Unfortunately, each high-throughput technology seems to have its own biases in quantifying expression, making it difficult to unambiguously identify the lowest-expressed genes.

For microarray data I usually plot a histogram of (log-transformed) expression values across the entire data set. A typical histogram from an Illumina data set is shown below.

IlluminaHistogram-edited

The important observation is that starting at the lowest expression values, the frequency increases until a peak is attained at still fairly low expression values. My simple interpretation is that the measurement noise and background are responsible for most of the expression values that are lower than or approximately equal the peak. I tend to select a cutoff value near the peak or where the initial steep increase levels off and filter out those probes whose expression is below the peak in a suitable majority of the samples (the exact details depend on the experimental design and data). This approach usually leads to a reasonable number (10-15k) of retained genes.

For RNA-seq summarized to counts, I simply filter on absolute counts, since one could make the argument that genes with low observed counts (say below 5 or 10) are probably not really expressed, or that their expression cannot be reliably measured. Furthermore, low count data (which usually also contains many zeros) are not really suitable for a correlation analysis. I should say that this simple approach works fine for bulk (tissue) sequencing data but I would be very careful about applying it to single-cell sequencing where many interesting genes may have zero counts in most cells, and relatively low counts in the rest. (As an aside, just in the last few weeks several articles appeared that describe what are essentially imputation methods to deal with the many zeros in single cell RNA-seq data.)

For RNA-seq summarized to RPKM/FPKM, the approach is the same but it is clear (at least to me) what constitutes values that cannot be considered expressed and approximately continuous. People often use a thresholds of 0.1 to 1. I haven’t done enough work with RPKM or FPKM data to have a good opinion about what a suitable cutoff would be; my approach would be to set the filter cutoff such that a reasonable number of genes remains. For a simple tissue consisting of a single cell type that would be around 10k genes, while for complex tissues (such as brain) I would keep 15-20k genes.

Collapsing data to larger functional units

Collapse is often a bad thing, but not here! When the variables (for example, microarray probes) in a data set represent larger functional units (for example, genes), with multiple variables per unit, it may make sense to “collapse” the multiple measurements for each unit into a single profile. There are many possible ways of either selecting one representative variable for each larger functional unit or of combining all or some of the variables into a summary measure for each unit. Selection of representative probes is explored in the article Strategies for aggregating gene expression data: The collapseRows R function by Jeremy Miller and collaborators (link to article, link to dedicated page). For microarray data that contain abundance measures, the article recommends (and I concur) to select the probe with highest mean expression as the representative for each gene. The catch is that this simple recipe cannot be used for data that report ratios relative to a reference sample (such as two-color arrays).

Instead of selecting a single representative variable (probe), one may be tempted to average all variables belonging to a single unit (gene). The function collapseRows can do it, as can others such as function mapNmerge from package gCMAP. One has to be careful here that none of the variables contain missing data (which could lead to large biases in the mean) and that the mean is not explicitly or implicitly weighted toward variables that are measured less confidently (this is a bit technical, so feel free to skip the rest of this paragraph). For example, microarray data are often log-transformed as part of preprocessing and normalization; this results in data where variance is approximately independent or slightly decreasing with mean expression (lowest expressed genes, after log transformation, tend to have slightly higher variances than highly expressed genes). Averaging a low-expressed and high-expressed probe in such data would result in a summary expression measure that more closely reflects the low-expressed probe than the high-expressed one. This makes little sense since measurements by high-expressed probes are by all accounts more reliable than those of low-expressed ones. Hence, I recommend against using mean summarization on log-transformed data; one could use the summarization before log transformation when high-expressed probes tend to have higher variance than low-expressed ones, or one could use a weighted mean with weights that are higher for high-expressed probes.

The takeaway is that whenever either filtering or collapsing make sense, they can (and should) be used to reduce the number of variables before starting an analysis. Filter, collapse, and if the data are still too large to be analyzed in a single block, use the blockwise approach.

When can correlation network analysis be useful and when you are better off not using it

Weighted correlation analysis in general and WGCNA in particular can be applied to many problems and data sets, but certainly not to all. To set the terminology straight, recall that, in a correlation network, the each node represents a variable (feature), and links represent correlations, possibly transformed, among the variables. Although one could construct the network just for the sake of the network, more commonly the networks are used to study which variables (and which groups or “modules” of variables) are likely to be important for a property of the system that is represented by the network.

Correlation network analysis makes a few important assumptions about the system and its properties one wants to study:

  • Collective behaviour of multiple nodes in the network should be relevant for the property one wants to study. This is in some ways the most important assumption. Network analysis is about studying the collective behaviour and interplay of nodes; if these are irrelevant for the property one is interested in, network analysis will be of little help.
  • Correlations should reflect functional relationships, at least to some degree. Correlation networks are based on correlations of the network nodes (variables); it is assumed that these correlations at least partially reflect functional relationships. For example, in gene co-expression and co-methylation networks, functionally related genes are often but not always correlated. However, correlation can also be caused by technical artifacts such as batch effects or poor normalization. For certain data types, correlations reflect inherent relationships but those relationships are not interesting — for example, genotype SNP markers are usually strongly correlated with other nearby SNPs because of linkage disequilibrium, not because of functional relationships.
  • Functional relationships reflected in the correlations should be relevant for the property one wants to study. This may seem a bit redundant, but bear with me. Since correlation networks are (usually) constructed from data without regard to a particular property (technically, they are built in an unsupervised manner), the correlations will reflect the largest drivers of variability in the network node variables. If these drivers are unrelated to the property, the network analysis may find beautiful, functionally coherent and meaningful network modules that are nevertheless entirely useless for the studied property. A somewhat contrived example would be a gene expression study of a disease across several different tissues (say liver, muscle and fat tissue). Were one to combine the data from different tissues into a single data set and run a network analysis on it, the modules will mainly correspond to different tissues or perhaps major cell types. This makes biological sense and may even allow to classify previously unstudied genes in terms of their expression across tissues, but will likely provide no information about the disease.
  • Calculating correlation on the data should make sense. (I know, sounds obvious.) Pearson correlation and other related measures (e.g., robust modifications) work well on data that can at least approximately be thought of as continuous with a symmetric distribution that is not too heavy-tailed compared to the normal. An example of data on which it does not work well are binary variables, low counts (when most counts are at most 3 or so), and especially so sparse counts (when most counts are 0).

In addition, as with most other data analysis methods, one needs a reasonable amount of reasonably clean data. Network analysis and WGCNA are no magic wands; if the data contain a lot of technical artifacts or noise, the results will not be useful*. Because network analysis is unsupervised, it is important that known and inferred sources of unwanted variation be adjusted for. Although effects of outliers can be minimized through the use of robust correlation calculations, it is better to remove samples that are clear outliers.

How many samples are a “reasonable amount”? If possible, at least 30 samples, ideally at least 50, assuming the relevant signal in the data is strong enough that the number of samples can be expected to yield significant findings. At the low end, I would not spend time analyzing a data set with less than 10 samples in WGCNA. Anything less than 15 samples is also not likely to yield deep insights, although, depending on the experimental design, WGCNA may yield more robust and interpretable results than an analysis of individual differential expression.

 


*The GIGO principle applies: Garbage In, Garbage Out.

WGCNA resources on the web

This post collects a few links to WGCNA-related material posted elsewhere on the web.

First and foremost, the WGCNA page maintained by me (PL) is the place to go for WGCNA downloads, the original set of tutorials and an FAQ.

Steve Horvath wrote a comprehensive book on weighted network analysis called, appropriately, Weighted Network Analysis: Applications in Genomics and Systems Biology. It is available as an e-book from Springer as well as a regular paper book from Amazon. If you are affiliated with a university, your library may have an electronic subscription to Springer and you may be able to download the book without having to pay.

Several recorded lectures from various network analysis courses are available on youtube:

A full playlist of videos from the 2015 Short Course on Network Analysis is also available.

The co-expression network analysis home page maintained by Steve Horvath and his lab contains links to multiple scientific articles that introduced aspects of WGCNA and used it in a variety of applications.

“Blockwise” network analysis of large data

A straightforward weighted correlation network analysis of large data (tens of thousands of nodes or more) is quite memory hungry. Because the analysis uses a correlation or similarity matrix of all nodes, for n network nodes the memory requirement scales as n2. In R, one has to multiply that by 8 bytes for each (double precision) number, and then again by a factor of 2-3 to allow for copying operations the R interpreter needs to do while executing operations on the matrices. When all is said and done, analyzing a set of 20,000 variables or nodes (for example, genome-wide data summarized to genes) requires between 8 and 16 GB of memory; 40,000 nodes (for example, expression data summarized to transcripts) would increase the requirement to 32-64 GB. My personal experience is that 40k transcripts pushes a 32GB memory space very hard. A full network analysis of an Illumina 450k methylation data set with its nearly 500,000 probes would theoretically require some 7 TB of memory.

Are large data and small RAM a no-go for WGCNA? Apart from persuading the account manager to purchase a larger computer or access to one, there are at least two options for tackling large data with WGCNA. The first option is to reduce the number of nodes (variables) in the data, by filtering out uninformative variables or combining (“collapsing”) multiple variables into one. This approach is often effective for gene expression data. But sometimes one simply cannot reduce the number of variables sufficiently without losing too much information. What then? Despair not.

WGCNA implements a trick that allows an approximate analysis of large data sets on computers with modest memory. The trick consists in splitting (“pre-clustering”) the network nodes into blocks such that nodes in different blocks are correlated weakly and that correlation can be neglected. Since the between-block correlations are assumed to be negligible, the network analysis is carried out in each block separately. After modules are identified in all blocks, their eigengenes are calculated and modules with highly correlated eigengenes are merged (possibly across blocks).

This block-by-block (“blockwise”) analysis limits memory requirements to the (squared) size of the largest block. When the largest block is say 1/10th of the number of nodes, the memory requirement goes down by a factor of 100 – that outlandish-looking network analysis of an Illumina 450k data set suddenly becomes doable on a reasonably modern server with say 96GB of RAM. A beneficial “side effect” is that network construction, particularly TOM calculation, also runs much faster. TOM calculation in a block of size nb takes on the order of nb3 operations; if the block sizes are about 1/10th of the number of all nodes, a block-wise TOM calculation will require 10 (n/10)3 = n3/100 operations, or only 1/100 of the time the TOM calculation for data in one large block would take. On the flip side, the pre-clustering also takes some time but is usually faster than a full network analysis would be if it were feasible.

Tips for practical use

The pre-clustering for individual data sets is implemented in function projectiveKMeans. Function consensusProjectiveKMeans handles consensus pre-clustering of multiple data sets. These two functions return a vector of block labels for each variable (node) in the input data; nodes with the same label should be put into one block in subsequent network analyses. Many users will want to use the “one-stop shop” blockwise network analysis functions blockwiseModules and blockwiseConsensusModules (for consensus network analysis). The WGCNA tutorials show the most common use of these two functions.

I emphasize that the blockwise analysis creates an approximation to the network that would result from a single block analysis. The approximation is often very good but the modules are not quite the same. If possible, I recommend running the analysis in a single block; if not, use the largest blocks your computer can handle. Block size in the “blockwise” family of functions is controlled by argument maxBlockSize; for historical reasons, backward compatibility, and reproducibility, the default value is fixed at a relatively low (by today’s standards) value of 5000. The value should be raised as high as possible: 16 GB memory should be able to handle up to about 24,000 nodes; 32 GB should be enough (perhaps barely so) for 40,000 and so on.

Since most analysis steps after module identification use only the module labels but not the full TOM matrix, it does not matter whether the modules were identified in a single block or in multiple blocks. There are a few exceptions though. In standard WGCNA analysis, it is fairly common to plot a gene clustering tree (dendrogram) with module colors and possibly other gene-level information below the dendrogram. Users also sometimes want to plot a heatmap plot of the TOM matrix. These plots can only be created separately for each block; there is no meaningful way to create a single gene dendrogram or a TOM heatmap plot that combines information from multiple blocks.