Big Data Storage and Graph-Based Analytics for Cancer Research

Big data technologies are playing an increasing role in cancer research. Two ways that I’ll touch on are the storage of an exponentially growing quantity of sequenced genomic data and analysis of pathway disruption by large-scale graph algorithms.

There is an amazing diversity in cancer cells. A single tumor may contain 100 billion cells, each with differing somatic mutations. One patient exhibits so many types of mutations that a tumor sample will not give a comprehensive picture of the mutations present in that individual, hence the push to sequence as many genomes as possible.

Since the completion of the Human Genome Project in 2003, we have increased the efficiency of our gene sequencing to over 10k giga base pairs per week, resulting in a huge increase in the amount of archived genomic data. Parallel technologies now allow us to sequence much more quickly and cheaply, so that the price of sequencing a genome has dropped from $100 million in 2003 to $1000 in 2016. Unlike much other scientific big data, such as from the Hadron Collider, researchers will typically store all genomic data for future reference.

This drop in cost, together with the highly varied nature of cancer cells, is rapidly propelling the amount of data supporting cancer research into the realm of big data. There are already more than 1000 genomic sequencing centers in the world, and projections are that, within a few years, the amount of storage consumed by sequenced human DNA will be growing at an annual rate of 2-40 Exabytes.

Big Data analytics of pathway disruptions

Cancer is characterized by the disruptions of signaling pathways (interacting proteins) within a cell. The disruption of a pathway may be the signal of the cancer, but any given pathway may be affected in different patients by different somatic mutations, as the network could be disrupted in different ways. The result is a vast quantity of data from which we can attempt to solve a very complex and very important problem.

How is this data analyzed?

When researching a cohort of patients, each with a set of somatic mutations, we don’t initially know which of these mutations are driving the cancer (“driver mutations”). We will sequence the genes, record which are mutated, and look for unusually commonly shared mutations that are likely to be drivers for the cancer. In the past, researchers analyzed individual genes to see which mutated more commonly (there would never be a gene that mutated across all patients), an analysis which would require advanced statistical corrections across 25k genes.

Some researchers are now focusing instead on entire protein pathways. Here again, we face the challenge that pathway mutations differ between patients. For example, a 2008 study by TCGA published in Nature found a specific pathway mutated in 87% of patients in a sample of Glioblastoma patients: significant, but not consistent.

This is where big data analytics comes in. We start by constructing a gene interaction network: a graph with 25k vertices and 625m edges. Pathways are represented by subnetworks in this graph. We use an iterative big data graph algorithm library such as Gelly (part of Apache Flink) to find connected subnetworks that are mutated in a significant number of patients.

This method has produced interesting results. When applied to data from a cohort of 316 patients, researchers found 27 subnetworks (of at least 7 genes) implicated in ovarian cancer, many of which interact with each other. Thus, a more complex picture started to emerge. Twelve of these 27 were known pathways that had been studied, and the other 15 were new to researchers.

This analysis was also applied to a study of 199 patients with Acute Myeloid Leukemia. It uncovered five subnetworks of at least five genes, each of which were significantly mutated. When applied to a sample of 514 patients with breast cancer, it found thirteen subnetworks of at least eight genes each. (Publications in Nature in 2008 and 2011)

A Combined Effort to Treat Cancer

The ultimate goal is personalized cancer treatment based on a deep understanding of the mutations of each patient’s cancer genome. This brings us to the data challenge of consolidating genomic archives, research literature, trial results, and individual health records, aggregating across disparate structured and unstructured data sets.

A growing community is forming around the effort to assemble a shared genomic database for use in combating cancer. In 2016, the National Cancer Institute launched the Genomic Data Commons (GDC), referring to it as ‘a first-of-its-kind public data platform for storing, analyzing, and sharing genomic and associated clinical data on cancer.’ Since its formation, several research centers have shared their cancer genomic dataset with the GDC. Ref, Ref

As we see the convergence of computing and medical technologies in these and other cancer research methodologies, we look forward to increased benefits from big data analytics in the fight against cancer.

On a side-note, a key aspect in healthcare AI is having in place a product owner capable of steering these data science projects effectively.