Between the mapping centers, generating enormous quantities of data for the NIH Roadmap Epigenomics Initiative, and the NCBI that is archiving and distributing it, lies the Epigenomics Data Analysis and Coordination Center.
Aleksandar Milosavljevic and co-PI Arthur Beaudet won the 5-year, $7+ million U01 grant last year to set up and run the EDACC informatics clearinghouse at Houston’s Baylor College of Medicine, and so far so good. “Our program officer seems to be happy,” Milosavljevic says.
The Great Enabler: Tools For Handling Epigenomic Data
“If I were to really summarize our role in one sentence, our role is to enable data flows, quality controls, primary data processing, and integrative analysis.”
Unpacking that a bit, Milosavljevic explains that once the raw data is received from the four Reference Epigenome Mapping Centers by EDACC, “the outcomes of individual assays are interpreted and reference epigenomes constructed.”
From there, “we generate additional biological insights from integrative analysis of different marks collected on the same genome, or cell line, or tissue — and also analysis of marks across different samples in tissues — to understand variations in epigenomes due to development, physiological conditions, aging, and other variables.
“We have an additional role, which is not strictly part of current grant, to enable disease-oriented epigenomes projects. (The Roadmap Initiative has several components, one of which is disease-focused research.) This requires the reference epigenomes as surrogate controls.”
Blazing New Trails In Epigenome Informatics
All the data generation “requires methodological development — trail blazing if you wish – which is done by reference genome mapping centers.
“But those require additional trail blazing on the informatics side: methods development; definition of data processing steps and data types; understanding of how the quality of the data should be controlled; what can you expect in terms of reproducibility. How do we compare epigenomes? How do we find significant local differences, [or] variations from reference?
Mapping centers generate data and its associated metadata, and upload it to the EDACC. And “we process the data” at various levels, from raw data through “an integrative analysis of data from multiple marks or multiple samples” — akin to those used in the Cancer Genome Atlas Pilot Project, which Milosavljevic helped to create. “This is evolving: I think we have level 2 reasonably well defined.
“And also it requires informatics infrastructure, which is built through an engineering effort.
The EDACC is building upon Baylor’s now well-established Genboree system. “Our idea was that we should be able to enable genome-centric research without expecting that researchers would own hardware or install software. We hoped that everything they need could be provided by a remote-hosted software.
“This is also where we plan to contribute by using software as a service, or cloud computing, or Web 2.0 model, to provide epigenomic software as a service to small projects which may be led by a practicing physician who sees patients in the morning and does epigenomics research in the afternoon, using samples collected from the patients.
“We’d like to provide Genboree services, and consulting services, to them to the degree that are required. And enable them to quickly adopt these methodologies, use reference epigenomes, analyze their data, come up with results much faster than without working with us.
Who is Dr. Milosavljevic?
“I’m a computer scientist by training.” Coming to Baylor, after successful stints in the commercial sector, provided Milosavljevic with “an opportunity to bridge high-throughput data generation with biomedical discovery. Baylor was very generous with supporting my startup. And I brought a number of engineers from key companies who worked in the bioinformatics department at Genometrics and elsewhere. I had critical mass of engineering talent to take a fresher look at genome informatics.”
The EDACC’s co-director “is Art Beaudet, my department chair here at Baylor, who is an established epigeneticist. I am director of the center and I bring in genomics and informatics components to it.”
Beyond the Stone Age of Epigenetic Data Analysis
Data analysis was perhaps the principle bottleneck of the Human Genome Project. But “technology has advanced since then. 2001 looks like a stone age from the point of view of The Web. In the past two years major dot-coms (Amazon, Google) have come up with the next generation of technologies that allow third parties to develop code that works with their web services. So we are following all of these technology developments very closely, and translating these technology developments into our software systems.
“I can’t imagine how we could have done this project in 2001, not only from the sequencing point of view – everyone is focusing on the surplus of sequencing machines – but also from the point of view of the maturity of the Web and related technologies.
Without XML software to support this processing, without application programming interfaces based on the web, without high bandwidth Internet connections, we couldn’t have done this project.
“I believe we have solved (I dare to say) all technical throughput problems and at this point getting the group to adopt new procedures and new ways of collaborating is becoming a bottleneck. I think now we have eliminated the technical aspects and we’re looking now at social aspects of doing science in a new way, that’s becoming a bottleneck.
“We had a very challenging year but my own personal expectations were exceeded. … We are ready to accept even higher volumes of data than are currently being produced.