Functional enrichment analysis via R package anRichment

At some point in most any analysis of high-throughput data one wants to study enrichment of a resulting set (or sets) of genes in predefined reference gene sets. Although there are many tools out there that let the user evaluate enrichment in standard reference sets such as GO and KEGG, there are relatively few that allow the user to build, store and reuse custom collections. For these and other reasons I put together an R package called anRichment that allows one to run standard enrichment calculations against the usual collections of reference gene sets such as GO, KEGG and others, as well as against custom gene lists such as ones available through the use of userListEnrichment function in the WGCNA package.

It all started around 2009 or so, when the standard way of studying functional enrichment was to upload relevant gene lists one by one to a web server such as DAVID, and download the resulting tables. This works fine for one analysis with just a few gene lists but is not really suitable for automating analyses or even just to trying out several different sets of parameters for WGCNA (uploading 20-30 modules after every parameter change gets tiresome real fast). After looking through then-available Bioconductor packages, I did not find anything that suited my needs, so I wrote my own GO enrichment function GOenrichmentAnalysis, still available but now deprecated in WGCNA. Around the same time, neuroscientist Jeremy A. Miller collected multiple brain-related reference gene sets from published literature and wrote the function userListEnrichment to study enrichment of input gene sets in his collection of brain gene sets. (In case you’re wondering, the function was published in this article.)

Although running automatable R code was a big improvement on uploading modules to DAVID every time something in an analysis changed, it eventually became clear to me that we need a unified enrichment calculation function for both types of reference gene sets; having two separate functions is inconvenient and at the very least makes it necessary to re-run multiple testing correction after the results have been combined.

Another area that I didn’t see addressed in the software packages available then was the ability to define groups of individual gene sets based on their origin (say tissue, technology etc), interpretation, or any other characteristics that may be relevant. One could then restrict a large collection of say all brain sets to say just cortex-related sets, or just disease-specific sets etc. Since a gene set could belong to many groups, one could also think of the groups as tags.

Around mid-2014, I put together the first versions of an R package for annotation and enrichment calculations, and eventually called it anRichment (for, well, annotation and enrichment, with a capital R to emphasize R language in which it is written). Over time the package as well as number of gene sets in it grew and the newest and greatest version is actually split into two packages, anRichmentMethods for the functions, and anRichment itself for data and accessor functions. It is best to think of the two essentially as one package, split only so one would not have to download and reinstall several tens of megabytes worth of data every time a function changes or gets added.

What does anRichment do?

The packages aim to do a few things I found useful in my own work:

  • Collect interesting gene sets in an organized, tagged collection for relatively easy retrieval and manipulation. This includes functions for creating custom gene sets and annotating them with tags.
  • Combine standard databases of functional gene sets such as GO, KEGG and others with custom collections of gene sets in a unified structure allowing equal treatment and use.
  • Calculate enrichment of query sets in reference gene sets and output all relevant statistics in a convenient format. At present enrichment is evaluated using Fisher exact test only.
  • Provide supporting functions for the multitude of smaller tasks that often crop up when collecting gene sets or calculating enrichment.

To help users get started, I wrote an introductory tutorial. It contains a simple example calculation of enrichment of WGCNA modules in GO, sketches out some of the more advanced capabilities of the package and provides information to hackers who would like to tinker with the existing code. Worth checking out! (Says the author :))

Collections available in anRichment

The package either stores or provides access to multiple collections of reference gene sets:

  • GO, KEGG, NCBI BioSystems pathways: The starting point of most enrichment calculations. GO sets are accessed though Bioconductor annotation packages while KEGG and other components of the NCBI BioSystems pathway database are stored internally.
  • Internal collection: The original collection of gene sets collected by Jeremy A. Miller while he was a PhD student at UCLA. It contains brain and blood-related gene sets from various published articles.
  • HD Signatures Database (HDSigDB): A collection of gene sets directly or indirectly related to Huntington’s disease (HD). This collection is maintained by Rancho Biosciences under contract from CHDI, Inc. The HDinHD portal contains detailed descriptions of the gene sets. HDinHD requires free registration to access the data.
  • Miller AIBS collection: More gene sets collected by Jeremy A. Miller up to about 2014. Contains brain development-related gene sets, transcription factor targets and others.
  • HD Target DB: HD-related gene sets collected originally by Michael Palazzolo and Jim Wang for CHDI. Contains functional sets compiled from literature as well as textbooks, gene sets from HD perturbation studies, protein-protein interactor sets and others.
  • Neurogenomic sets collected by X. William Yang and members of his lab: Another collection of gene sets curated from published articles that people in Yang lab found useful in their research.
  • Positional gene sets: Each gene set contains genes within a certain window around a given genomic position. These sets are generated dynamically from Bioconductor annotation packages.
  • Molecular Signatures Database: anRichment provides a function that converts the Molecular Signatures Database (MSigDB) in XML file format into an anRichment collection. Users wishing to use MSigDB need to obtain the XML file from Broad Institute.

But wait, there’s more! Several additional packages provide additional collections that collect WGCNA modules from my own analyses of HD data and brain-related gene sets culled from the literature by our friends at Verge Genomics.

The reader has by now surely noticed that most custom collections in anRichment focus on neuroscience in general and Huntington’s disease in particular. Well, that mirrors the focus of my own work and I suppose it will make anRichment, as it stands now, most useful to the neuroscience community. I surely hope that people in other fields who have their own favorite collection of literature gene sets find anRichment useful and perhaps share their collection or collections with the wider world.