Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers.
Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments.
These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF.
In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo