BETA
This is a BETA experience. You may opt-out by clicking here

More From Forbes

Edit Story

Building A World Class Genetics Center Based On Data Scalability

Following
This article is more than 3 years old.

The ability to accelerate drug discovery is necessary. I recently spoke with Jeffrey Reid, Head of Genomics and Data Engineering for Regeneron. Reid works in the Regeneron Genetics Center (RGC), a research initiative that seeks to improve patient care by using genomic approaches to speed drug discovery and development. The genetics center is a unit of Regeneron (NASDAQ: REGN), a leading biotechnology company that has been at the forefront of drug discovery for 3 decades. The firm’s focus on translating science into medicine has led to seven FDA-approved treatments. The Regeneron Genetics Center is engaged in one of the largest genetics sequencing efforts in the world.

Reid describes his role as existing at the intersection of science and data, noting that he is responsible for “taking raw data and turning it into usable facts about genomes”. His role in data engineering entails the deployment of algorithms that enable drug development. As part of building a large genetic sequencing center, Reid works with more than 80 industry and academic research partners to combine genetics data with electronic health record (EHR) data to understand how genetics impact health. 

To enable these drug discovery efforts, Reid and his team have deployed the Databricks technology platform to help mine genomic data at scale. Reid remarks, “We bring to bear a lot of robotics in the lab and analysis automation”. He emphasizes the urgency of operating at scale, given the billions of combinations of genotypes and phenotypes that can be mined for drug development insights.  “We need to identify every possible association between each genotype and phenotype. This requires us to analyze billions of cells of information”, says Reid. 

Databricks provides Regeneron with a scalable solution for mining these vast amounts of data. Reid notes that in the past, there was no scalable approach to managing volumes of data this large, and research companies were dependent upon home-built solutions based on antiquated approaches and technologies. According to Reid, Databricks delivers an enterprise platform that operates on the FAIR data principles of making data “findable, accessible, interoperable, and reusable” and helps drive scientific insights. Reid describes a technology environment at Regeneron characterized by what he describes as “tune up, deploy, tear down clusters” that support collaborative research initiatives such as Project Glow, an open-source toolkit for large-scale genomic analysis that was jointly created by the Regeneron Genetics Center and Databricks.

Databricks was launched in 2013 as what the firm’s CEO and cofounder, Ali Ghodsi, describes as a unified data analytics platform which combines data with AI/ML. Rather than managing data on one technology platform and analytics on another, Databricks has sought to create a single platform, leveraging the on-demand availability and economies of scale of Cloud computing, and designed to enable “massive scale data engineering and collaborative data science’’.

Frank Nothaft, Technical Director Healthcare and Life Sciences for Databricks notes that Healthcare and life sciences (HLS) firms have been among the earliest adopters of distributed and Cloud computing. This is driven by a need to support highly collaborative scientific computing and drug discovery activities. “Large pharmaceutical firms were at the forefront of early movement of healthcare and life sciences into the Cloud” says Nothaft. This was driven by work on genomic databases, phenotypes, genetic markers, drug discovery, and the need to reduce the high costs of creating new drugs. While mainstream industries like financial services have been among the heaviest users and investors in data management capabilities, they have been slower to adopt Cloud computing. 

Life sciences firms have also been leaders in the early adoption of AI/ML.  Ghodsi and Nothaft cite the current healthcare challenge presented by coronavirus as an example of how HLS firms can convert raw data into higher quality data more rapidly. As an example, they note the current testing shortage and how a unified AI/ML data platform can enable a real-time picture of the available supply. They observe that the data problem presented by pandemics is one of being able to predict the course and trajectory of the disease. They note the impact on clinical trials, where data pipelines can be rapidly built via live feeds to Electronic Health Record (EHR) data in a matter of minutes. 

The ability to use AI/ML with access to public datasets can also enable faster tracking, as illustrated by the Johns Hopkins coronavirus tracking initiative.  Ghodsi forecasts, “AI/ML will enable massive societal change resulting from algorithmic automation of the most basic and mundane everyday tasks.  There will be substantial cost savings realized from rapid pattern learning”. Ghodsi notes that though there is always interest in sexy AI applications like self-driving cars, so-called ‘boring AI’ is where 95% of the benefit will be derived.  

Regeneron’s Reid expresses a contrarian view however, concluding from his vantage point, “I am an AI pessimist. AI/ML is high hanging fruit from where I sit. We are focused on harvesting the low hanging fruit – culling large datasets at scale to derive a tremendous payback”. This is not even “boring AI”, but the result is highly impactful data management and analysis targeted at outcomes that will benefit mankind.

Follow me on Twitter