Filtering and collapsing data

I wrote recently about the “blockwise” approach that allows the WGCNA package to analyze large data with modest computational resources. This is all nice and well, but it often makes sense to reduce the number of variables in the data set before even starting the analysis. The simplest reduction is to filter out the variables least likely to be informative. Another option, available when multiple variables represent the same functional unit, is to collapse them such that each unit is represented by a single variable.

Filtering out uninformative variables

For microarray and RNA-seq data, I tend to filter out low-expressed variables (probes or genes). The rationale is that in most tissues, one can expect at most 10,000 to 15,000 genes to be actually expressed; hence, several thousand of the lowest-expressed genes are likely simply measurement noise. Unfortunately, each high-throughput technology seems to have its own biases in quantifying expression, making it difficult to unambiguously identify the lowest-expressed genes.

For microarray data I usually plot a histogram of (log-transformed) expression values across the entire data set. A typical histogram from an Illumina data set is shown below.


The important observation is that starting at the lowest expression values, the frequency increases until a peak is attained at still fairly low expression values. My simple interpretation is that the measurement noise and background are responsible for most of the expression values that are lower than or approximately equal the peak. I tend to select a cutoff value near the peak or where the initial steep increase levels off and filter out those probes whose expression is below the peak in a suitable majority of the samples (the exact details depend on the experimental design and data). This approach usually leads to a reasonable number (10-15k) of retained genes.

For RNA-seq summarized to counts, I simply filter on absolute counts, since one could make the argument that genes with low observed counts (say below 5 or 10) are probably not really expressed, or that their expression cannot be reliably measured. Furthermore, low count data (which usually also contains many zeros) are not really suitable for a correlation analysis. I should say that this simple approach works fine for bulk (tissue) sequencing data but I would be very careful about applying it to single-cell sequencing where many interesting genes may have zero counts in most cells, and relatively low counts in the rest. (As an aside, just in the last few weeks several articles appeared that describe what are essentially imputation methods to deal with the many zeros in single cell RNA-seq data.)

For RNA-seq summarized to RPKM/FPKM, the approach is the same but it is clear (at least to me) what constitutes values that cannot be considered expressed and approximately continuous. People often use a thresholds of 0.1 to 1. I haven’t done enough work with RPKM or FPKM data to have a good opinion about what a suitable cutoff would be; my approach would be to set the filter cutoff such that a reasonable number of genes remains. For a simple tissue consisting of a single cell type that would be around 10k genes, while for complex tissues (such as brain) I would keep 15-20k genes.

Collapsing data to larger functional units

Collapse is often a bad thing, but not here! When the variables (for example, microarray probes) in a data set represent larger functional units (for example, genes), with multiple variables per unit, it may make sense to “collapse” the multiple measurements for each unit into a single profile. There are many possible ways of either selecting one representative variable for each larger functional unit or of combining all or some of the variables into a summary measure for each unit. Selection of representative probes is explored in the article Strategies for aggregating gene expression data: The collapseRows R function by Jeremy Miller and collaborators (link to article, link to dedicated page). For microarray data that contain abundance measures, the article recommends (and I concur) to select the probe with highest mean expression as the representative for each gene. The catch is that this simple recipe cannot be used for data that report ratios relative to a reference sample (such as two-color arrays).

Instead of selecting a single representative variable (probe), one may be tempted to average all variables belonging to a single unit (gene). The function collapseRows can do it, as can others such as function mapNmerge from package gCMAP. One has to be careful here that none of the variables contain missing data (which could lead to large biases in the mean) and that the mean is not explicitly or implicitly weighted toward variables that are measured less confidently (this is a bit technical, so feel free to skip the rest of this paragraph). For example, microarray data are often log-transformed as part of preprocessing and normalization; this results in data where variance is approximately independent or slightly decreasing with mean expression (lowest expressed genes, after log transformation, tend to have slightly higher variances than highly expressed genes). Averaging a low-expressed and high-expressed probe in such data would result in a summary expression measure that more closely reflects the low-expressed probe than the high-expressed one. This makes little sense since measurements by high-expressed probes are by all accounts more reliable than those of low-expressed ones. Hence, I recommend against using mean summarization on log-transformed data; one could use the summarization before log transformation when high-expressed probes tend to have higher variance than low-expressed ones, or one could use a weighted mean with weights that are higher for high-expressed probes.

The takeaway is that whenever either filtering or collapsing make sense, they can (and should) be used to reduce the number of variables before starting an analysis. Filter, collapse, and if the data are still too large to be analyzed in a single block, use the blockwise approach.