Vast amounts of DNA methylation data have paved the way for epigenetic clocks to significantly advance aging research; however, generative artificial intelligence (AI) may have now laid the foundation for more transformative insights into DNA methylation. Two preprints report the application of transformer architecture to develop “foundation” or large AI models trained on vast DNA methylation datasets that the authors hope will support more complex levels of analysis.
Let’s welcome CpGPT and MethylGPT: the new models that lay the foundation for a new era of DNA methylation studies.
CpGPT – “A new benchmark for DNA methylation analysis”
Our first article, from the lab of Bo Wang (University of Toronto), describes “Cytosine-phosphate-Guanine Pretrained Transformer” or CpGPT, a transformer-based deep neural network that leverages the attention mechanism – providing sample-specific importance scores for CpG sites – to learn relationships between DNA methylation sites by incorporating sequence, positional, and epigenetic info. They pre-trained their model on over 1,500 DNA methylation datasets (more than 100,000 samples) from a broad range of tissues and conditions (the comprehensive CpGCorpus dataset).
Let’s hear more from de Lima Camillo and colleagues on CpGPT as a new benchmark for DNA methylation analysis:
- CpGPT leverages an improved transformer architecture to learn comprehensive representations of DNA methylation patterns to impute and reconstruct genome-wide DNA methylation profiles from limited data
- CpGT demonstrates robust performance across multiple cohorts, demonstrating high accuracy and consistency across a variety of datasets and multiple metrics
- Analyzing sample-specific attention weights enables the identification of influential CpGs for each prediction
- Capturing sequence, positional, and epigenetic contexts allows CpGPT to outperform related models when fine-tuned but performs well without fine-tuning; in fact, CpGPT:
- Effectively differentiates between high and low survival individuals, highlighting the ability to capture biologically meaningful variations in aging/mortality
- Exhibits robust predictive capabilities for morbidity outcomes, incorporating multiple diseases and functional measures across cohorts
- Demonstrates associations with metabolic/lifestyle-related health assessments, cancer status, and depression measures, further highlighting its broad applicability
- CpGPT identifies CpG islands and chromatin states without supervision, indicating the internalization of biologically relevant patterns from DNA methylation data, highlighting the power of unsupervised deep learning
- CpGPT excels when fine-tuned for specific tasks, exemplified by its robust performance in the “Biomarkers of Aging Challenge,” which sets a new benchmark for DNA methylation analysis
Overall, the highly generalizable CpGPT framework combines deep learning and comprehensive DNA methylation data to establish a new benchmark in epigenetic analysis and offer a versatile tool for multiple applications in aging and beyond.
MethylGPT – “A transformer-based foundation model for DNA methylation”
Our second article, from the lab of Vadim N. Gladyshev (Brigham and Women’s Hospital/Harvard Medical School), reports on MethylGPT, a transformer-based foundation model that employs a novel embedding strategy to capture DNA methylation patterns at physiologically relevant CpG sites. The authors trained their model on over 150,000 human methylation profiles spanning diverse tissue types from 5,281 datasets and 49,156 CpG sites.
Let’s hear more from Ying and colleagues on MethylGPT, their transformer-based foundation model for DNA methylation:
- MethylGPT effectively models DNA methylation patterns and reveals fundamental aspects of regulation without external supervision
- MethylGPT addresses a fundamental limitation of traditional linear models that treat CpG sites as independent entities
- MethylGPT’s performance in age prediction across diverse tissue types (with significantly improved accuracy over existing methods) demonstrates its potential utility
- Resilience to missing data represents a notable aspect, maintaining stable performance with up to 70% missing data due to the model’s ability to leverage redundant biological signals across multiple CpG sites
- MethylGPT makes robust predictions and enables the systematic evaluation of therapeutic effects on disease risks when fine-tuned to mortality/disease prediction across a range of significant conditions when evaluating 18,859 samples from the Generation Scotland cohort
- These results demonstrate the considerable potential of MethylGPT in clinical applications
- Analysis of MethylGPT’s attention patterns reveals distinct methylation signatures between samples from young and old individuals
- The enrichment of development-related processes in younger samples and aging-associated pathways in older samples suggest the capture of biologically meaningful age-dependent changes in methylation regulation, which may offer novel insight into how DNA methylation patterns evolve
MethylGPT reveals that transformer architectures can model DNA methylation patterns while preserving biological information and integrate the analysis of multiple forms of DNA methylation data; furthermore, the robust performance when managing missing data suggests utility in research and clinical applications.
What Do We Build Upon this Foundation?
Implementing transformer architecture to develop foundation models trained on vast DNA methylation datasets provides the basis to build on our knowledge base regarding the interconnected topics of epigenetics, aging, health, and disease. The pertinent question is now – what will your lab build upon this foundation?
For more on the foundational nature of these new AI advances, click on those links to check out the original bioRxiv articles describing CpGT and MethylGPT.