SiftSeq: Classifying short DNA sequences with deep learning

Elan Stopnitzky
Insight
Published in
5 min readMay 14, 2020

--

Next generation sequencing makes it amazingly cheap to obtain short DNA sequences, but analyzing these sequences with traditional tools remains slow and difficult. In this post, I demonstrate how deep learning can be used to significantly improve upon earlier methods, with an emphasis on classifying short sequences as being human, viral, or bacterial. This technology could increase the speed with which novel pathogens such as SARS-CoV-2 are identified in the future to help to save lives.

Source code and Docker container for the software package can be found here.

How can we quickly understand emerging infectious diseases? The ability to sort out novel viral and bacterial DNA from human DNA in patient samples may aid this process. Image source: https://coronavirus.jhu.edu/map.html

Motivation

Beginning around December 2019, patients with severe respiratory infections began arriving at hospitals in Wuhan, kicking off a race to identify the cause of the illness. However, it would be about a month before the new pathogen was fully identified. This lag is due to the relative difficulty of isolating and culturing viral particles, sequencing their genetic material in sufficient quantities, and assembling sequence fragments into a new genome. I began to wonder if it would be cheaper and easier to instead sequence all the DNA in a patient sample, and then use deep learning to separate out the viral sequences for further analysis. As I discovered, deep learning is a powerful tool for short sequence classification and is likely to be useful in many other applications as well.

Background

Next generation sequencing has made it possible to obtain short DNA sequences at a tiny fraction of the cost of earlier techniques. However, analyzing short sequences remains a challenge. Current approaches can be categorized roughly as:

Alignment- Reads (i.e. sequenced segments) are matched to previously identified sequences in a database. This is difficult if the reads diverge significantly from known sequences, and time consuming if the reads must be checked against many reference genomes. Alignment-free methods, such as deep learning, allow for the identification of divergent sequences and can complement alignment by first narrowing down the search space.

Assembly- If enough DNA can be collected from a pure source, overlapping reads can be assembled into a novel genome. For microorganisms, this often requires that one is first able to isolate and culture them in a lab, which may be challenging. For sources potentially containing DNA from diverse organisms, such as samples of seawater or human gut flora, this process could again be aided if one was first able to quickly and easily sort the sequences by their probable origin.

Machine Learning- ML algorithms that use frequency counts of n-grams of bases as an embedding have shown good accuracy in sorting metagenomic reads, but do not work well for sequences shorter than about 500 bases due to poor statistics (for example, see the TETRA algorithm). These methods have also recently been used to identify whole-genome relationships in the context of SARS-CoV-2.

The SiftSeq deep learning model I have developed can complement traditional bioinformatics methods to reduce time and cost by classifying short, novel sequences in an alignment-free manner. Deep learning, and in particular the use of long short-term memory (LSTM), makes it possible to capture information about potentially complex spatial patterns of bases, and therefore sidesteps the issue of poor count statistics for the embeddings of short sequences.

Possible applications of this work include:

  • Rapid pathogen identification for emerging diseases
  • Characterization of microbial communities in the gut and their implications for health
  • Surveys of microbial communities and environmental DNA for ecology and conservation
  • Identification of bacteriophages in the environment for phage-based therapies to combat antibiotic resistance

Comparison with prior work

Two recent models that use deep learning for this task are ViraMiner and VirNet. ViraMiner employs a CNN for distinguishing viral and human DNA, and VirNet employs an LSTM for distinguishing viral and bacterial DNA. SiftSeq combines both of these architectures synergistically to significantly improve accuracy while handling shorter sequences and all three classes: human, viral, and bacterial.

Data Pipeline

Input to the model is a FASTA file containing sequences of 100 base pairs. Sequences containing missing bases are removed and bases are transformed into a one-hot encoding. My training set consisted of DNA sampled from the complete genomes of 14 pathogenic viruses and bacteria as well as a portion of the human genome, and my test set consisted of a previously unseen part of the human genome together with DNA from 6 previously unseen viral and bacterial species. For each sequence, the model outputs the probability of coming from each of the three sources, and performance is evaluated using accuracy and area under the curve (AUC).

Model Architecture

Due to the genetic diversity of microorganisms and the speed with which they evolve, dropout layers are essential to prevent over-fitting. Frequency counts of n-grams of bases in sequences of DNA are well known to provide an accurate species signature. The longer the n-grams are, the more information they contain about the identity of the organism, but due to the fact that the space of n-grams grows as 4^n (for the four bases), there is a trade-off associated with the difficulty of getting good statistics on the counts.

Earlier researchers have found that sequences of length four, also known as tetramers, strike the ideal balance between these competing effects. Therefore, I introduced a 1D convolutional layer to recognize these n-grams regardless of their position in the sequence and chose the size of the receptive field to be four. Methods based on frequency counts alone do not, however, work well for sequences shorter than about 500 bases. The key innovation of my model is that by using a downstream LSTM layer to further analyze the patterns with which tetramers are arranged in the sequence, it can extract extra information and beat prior benchmarks while treating more classes and sequences down to 100 bases.

Conclusion

Deep neural networks, and in particular the SiftSeq CNN+LSTM architecture I have described here, are a powerful tool for the analysis of short metagenomic reads and can be used in conjunction with traditional bioinformatics tools for unprecedented ease of analysis. This technology shows great promise and with modest adjustments could see a wide range of important applications in medicine from speeding up the identification of novel pathogens like SARS-CoV-2 to gaining a more complete picture of gut microbial communities. The SiftSeq software package and associated Docker container can be downloaded for free here and used out-of-the-box, or retrained with user-supplied data.

Interested in seeing more like this? Sign up for our newsletter, and get updates on the latest resources and upcoming events.

--

--

Interdisciplinary scientist and ML engineer with special interests in complex systems, information theory, and biotech.