Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Genome Analysis Pipelines with Spark and ADAM

2,424 views

Published on

Spark is a powerful new tool for processing large volumes of data quickly across a cluster of networked computers.

Typical bioinformatics workflow requirements are well-matched to Spark’s capabilities. However, Spark is not commonly used because many legacy bioinformatics applications make assumptions about their computing environment. These assumptions present a barrier to integrating the tools into more modern computing environments.

These barriers are quickly coming down. ADAM is a software library and set of tools built on top of Spark that make it easy work with file formats commonly used for genome analysis like FastQ, BAM, and VCF.

In this presentation, we’ll explore how a step that is common to many bioinformatics workflows, sequence alignment, can done with Bowtie and ADAM inside a Spark environment to quickly align short reads to a reference genome. A complete code example is demonstrated and provided at https://github.com/allenday/spark-genome-alignment-demo

Published in: Technology

Genome Analysis Pipelines with Spark and ADAM

  1. 1. © 2015 MapR Technologies 1© 2015 MapR Technologies
  2. 2. © 2015 MapR Technologies 2 Let’s kick of a build while we talk… • Follow the instructions at • https://github.com/allenday/spark-genome-alignment-demo • While that’s cooking… • What’s this all about?
  3. 3. © 2015 MapR Technologies 3 Alignment Reference Sequences Aligned Reads Downstream Applications… DNA Reads
  4. 4. © 2015 MapR Technologies 4 Alignment Reference Sequences DNA Reads Aligned Reads Downstream Applications… Align()
  5. 5. © 2015 MapR Technologies 5 Possible Align() Outcomes Unaligned DNA Reads Reference Sequences Single Location Reads Multiple Location Reads Unlocatable Reads Align()
  6. 6. © 2015 MapR Technologies 6 Many-to-Many Relationship Between Reads and Locations • Read1 • Read2 • Read3 • Read4 • NULL • LocationA • LocationB • LocationC • LocationD • LocationA • NULL • LocationE
  7. 7. © 2015 MapR Technologies 7 Parallelizing Alignment Unaligned DNA Reads Locations Locations Locations Part1Part2Part3 Aligned DNA Reads Align() Concat() Sort() Etc…Split()
  8. 8. © 2015 MapR Technologies 8 Using HPC+SAN has Bottlenecks (SGE, PBS, Condor, Etc) Part1Part2Part3 Volume Read Bottleneck Volume Write Bottleneck Read & Write Bottleneck
  9. 9. © 2015 MapR Technologies 9 Using Spark Eliminates Bottlenecks Align() Concat() Sort()Split()
  10. 10. © 2015 MapR Technologies 10 How to Do it? FastQ => BAM Prepare Environment • Get Input Reference Sequences in FastA format (one time step) • Index Reference Sequences with Bowtie (one time step) Get Input Data • Get Input Unaligned Reads in FastQ format (per input data set) Do the Work • Convert Reads to ADAM Format with Spark (per input data set) • Use Spark Pipe to Align the Reads (per input data fragment) • Write the Alignments to SAM or BAM format (per input data, optional)
  11. 11. © 2015 MapR Technologies 11 Let’s look at some code… • Let’s check on the build we kicked off earlier. • And the driver script: • cat $DEMO/bin/bowtie_pipe_single.scala • I said before that it’s possible to be 100% compatible with legacy tools by writing out to SAM or BAM. • You might not want to do this…
  12. 12. © 2015 MapR Technologies 12 Benefits of Keeping Data in Spark / ADAM • ADAM file sizes are about 20% smaller than BAM. • ADAM files are faster for common operations (filters, range queries) • ADAM files sort more quickly than `samtools sort …`. This can be a huge time savings
  13. 13. © 2015 MapR Technologies 13
  14. 14. © 2015 MapR Technologies 14 • Some other commands built-in, take a look: • $ADAM_HOME/bin/adam-submit • $ADAM_HOME/bin/adam-submit count_kmers $DEMO/build/data/reads.sam /tmp/count_kmers 3 • head /tmp/count_kmers/part-00000 • All of this is easier without leaving the Spark shell • And it’s faster, too…
  15. 15. © 2015 MapR Technologies 15 And it’s faster, too… Using Spark Eliminates Bottlenecks Align() Concat() Sort()Split()
  16. 16. © 2015 MapR Technologies 16 Thanks! Questions? @allenday, @mapr aday@mapr.com linkedin.com/in/allenday

×