dmrseq Powers-Up Whole-Genome Bisulfite Sequencing Analysis

In our modern world, power is everything. Whether it be political, social, or even statistical, humankind always thirsts for more. While political and social power may be a little beyond our scope, a new bioinformatic package has come forth to power up your ability to detect differentially methylated regions (DMRs) from whole-genome bisulfite sequencing (WGBS) data.

The identification of DMRs by WGBS is no easy task; the high cost of sequencing leads to low sample sizes with low coverage and thus the requirement for complex statistical inference. Further complications arise from the correlated nature of CpG sites, which makes controlling the false discovery rate (FDR) challenging. However, as the price of sequencing continues to plummet, WGBS has emerged as a truly genome-wide method that can be applied to complex experimental designs.

To tackle the challenges of identifying DMRs from WGBS data, the lab of Rafael Irizarry at Harvard University (USA) has brought forth dmrseq. dmrseq builds on the data structure of the popular bsseq (BSmooth) package, which was also developed in the Irizarry lab, but offers a very different approach.

The identification of DMRs employs two critical steps:

DMR Detection: The differences in CpG methylation for the effect of interest are pooled and smoothed to give CpG sites with higher coverage a higher weight, and candidate DMRs are assembled
Statistical Analysis: A region statistic for each DMR, which is comparable across the genome, is estimated via the application of a generalized least squares (GLS) regression model with a nested autoregressive correlated error structure for the effect of interest. Then, permutation testing of a pooled null distribution enables the identification of significant DMRs
- This approach accounts for both inter-individual and inter-CpG variability across the entire genome

Notably, by performing the statistical testing on DMRs and not CpGs, dmrseq offers accurate FDR control. This approach also allows the direct adjustment of covariates in the model, an ideal situation for covariates that are continuous or contain two or more groups. Covariates can also be incorporated by balancing the permutations, which is ideal for two group covariates such as sex. Finally, dmrseq also allows for multi-group comparisons and can identify DMRs with a sample size as low as two per group.

By comparing dmrseq to bsseq, DSS, and Metilene, and examining the differences in DMR identification in data from the human epigenome roadmap, mouse models, or simulations, the talented team demonstrated the powerful capabilities of dmrseq in identifying DMRs.

Behind the WGBS Power

First author Keegan Korthauer shares the motivation behind the creation of dmrseq, “We noticed that existing tools for DMR identification from WGBS data were actually focused on discovering DMCs (differentially methylated CpGs), and then grouping them together to form DMRs. While these types methods will provide a list of putative DMRs, they suffer from two main drawbacks: (1) There is no way to evaluate statistical significance of the list of DMRs. Even if we know about the statistical significance of each CpG, there is no formula that lets us compute the region-level significance from that information. (2) It is not clear how to rank the DMRs in terms of their signal. Most researchers have settled on using some sort of heuristic, such as the average methylation difference across all CpGs in the region, or the number of CpGs in the region, but these types of measures will be misleading when trying to compare areas of the genome that have different degrees of spatial correlation and variance of methylation levels. Thus, we set out to develop a method that would overcome these issues, and provide a list of DMRs that (1) has an accurate error rate, so that you can specify the proportion of false discoveries you are willing to consider, and (2) ranks regions according the strength of signal, adjusted for local properties of spatial correlation and variance.”

Korthauer continues with how dmrseq will enable new insight from WGBS, “dmrseq provides a powerful new way to identify DMRs involved in complex traits or diseases. Because dmrseq provides a way to control the proportion of false discoveries and ranks DMRs by strength of signal, it provides a more compelling list of regions for characterization or followup study. As we demonstrate, we see a large enrichment of (the expected) association with expression of nearby genes for the most significant DMRs by dmrseq as compared to the DMRs with the highest average methylation difference. Our hope is that our tool will enable researchers to gain better insight into the role of DNA methylation in various biological processes.”

Finally, Korthauer concludes with the outlook that, “Moving forward, a challenge in the identification of DMRs from WGBS is to perform inference simultaneously at multiple scales of resolution. Currently, DMR discovery is highly influenced by the level of smoothing, which you can think of as how far you ‘zoom out’ when looking at patterns in noisy methylation measurements across the genome. You can adjust this according to your prior knowledge, or the type of regions you are interested in (e.g. small local regions versus large-scale blocks), but the ability to detect the scale of resolution automatically will be much more informative. In addition, as the technology to perform WGBS in single cells is rapidly maturing, an additional challenge (and opportunity!) will be to accommodate cell-to-cell differences in methylation levels in DMR identification.”

Get your hands on the dmrseq package over at Bioconductor and check out the full article in Biostatistics, February 2018.