Why is Bioinformatics 
(well, really, “genomics”) 
a Good Fit for Spark? 
Timothy Danford 
AMPLab
A One-Slide Introduction to Genomics
Bioinformatics computation is batch 
processing and workflows 
● Bioinformatics has a lot of 
“workflow engines” 
○ Galaxy, Taverna, Firehose, Zamboni, 
Queue, Luigi, bPipe 
○ bash scripts 
○ even make, fer cryin’ out loud 
○ a new one every day 
● Bioinformatics software 
development is still largely a 
research activity
State-of-the-Art infrastructure: 
shared filesystems, handwritten parallelism 
● Hand-written task creation 
● File formats instead of APIs or 
data models 
○ formats are poorly defined 
○ contain optional or 
redundant fields 
○ semantics are unclear 
● Workflow engines can’t take 
advantage of common 
parallelism between stages
So, why Spark?
Most of Genomics is 1-D Geometry
Most of Genomics is 1-D Geometry
The rest is iterative evaluation of 
probabilistic models!
Spark RDDs and Partitioners allow 
declarative parallelization for genomics 
● Genomics computation 
is parallelized in a small, 
standard number of 
ways 
○ by position 
○ by sample 
● Declarative, flexible 
partitioning schemes 
are useful
Spark can easily express genomics primitives: 
join by genomic overlap 
1. Calculate disjoint 
regions based on left 
(blue) set 
2. Partition both sets by 
disjoint regions 
3. Merge-join within each 
partition 
4. (Optional) aggregation 
across joined pairs
ADAM is Genomics + Spark 
● A rewrite of core bioinformatics tools and algorithms in Spark 
● Combines three 
technologies 
○ Spark 
○ Parquet 
○ Avro 
● Apache 2-licensed 
● Started at the AMPLab 
http://bdgenomics.org/
Avro and Parquet are just as critical to 
ADAM as Spark 
● Avro to define data models 
● Parquet for serialization format 
● Still need to answer design 
questions 
○ how wide are the schemas? 
○ how much do we follow existing 
formats? 
○ how do carry through projections?
Still need to convince bioinformaticians to 
rewrite their software! 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
Still need to convince bioinformaticians to 
rewrite their software! 
● A single piece of a 
single filtering stage 
for a somatic variant 
caller 
● “11-base-pair window 
centered on a candidate 
mutation” actually 
turns out to be 
optimized for a 
particular file format 
and sort order 
Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
The Future: 
Distributed and Incremental? 
● Today: 5k samples x 20 Gb / sample 
● Tomorrow: 1m+ samples @ 200+ Gb / sample? 
● More and more analysis is aggregative 
○ joint variant calling, 
○ panels of normal samples, 
○ collective variant annotation 
● And “data collection” will never be finished
Acknowledgements 
Matt Massie (AMPLab) 
Frank Nothaft (AMPLab) 
Carl Yeksigian (DataStax) 
Anthony Philippakis (Broad Institute) 
Jeff Hammerbacher (Cloudera / Mt. Sinai) 
Thank you! 
(questions?)

Why is Bioinformatics a Good Fit for Spark?

  • 1.
    Why is Bioinformatics (well, really, “genomics”) a Good Fit for Spark? Timothy Danford AMPLab
  • 2.
  • 3.
    Bioinformatics computation isbatch processing and workflows ● Bioinformatics has a lot of “workflow engines” ○ Galaxy, Taverna, Firehose, Zamboni, Queue, Luigi, bPipe ○ bash scripts ○ even make, fer cryin’ out loud ○ a new one every day ● Bioinformatics software development is still largely a research activity
  • 4.
    State-of-the-Art infrastructure: sharedfilesystems, handwritten parallelism ● Hand-written task creation ● File formats instead of APIs or data models ○ formats are poorly defined ○ contain optional or redundant fields ○ semantics are unclear ● Workflow engines can’t take advantage of common parallelism between stages
  • 6.
  • 7.
    Most of Genomicsis 1-D Geometry
  • 8.
    Most of Genomicsis 1-D Geometry
  • 9.
    The rest isiterative evaluation of probabilistic models!
  • 10.
    Spark RDDs andPartitioners allow declarative parallelization for genomics ● Genomics computation is parallelized in a small, standard number of ways ○ by position ○ by sample ● Declarative, flexible partitioning schemes are useful
  • 11.
    Spark can easilyexpress genomics primitives: join by genomic overlap 1. Calculate disjoint regions based on left (blue) set 2. Partition both sets by disjoint regions 3. Merge-join within each partition 4. (Optional) aggregation across joined pairs
  • 12.
    ADAM is Genomics+ Spark ● A rewrite of core bioinformatics tools and algorithms in Spark ● Combines three technologies ○ Spark ○ Parquet ○ Avro ● Apache 2-licensed ● Started at the AMPLab http://bdgenomics.org/
  • 13.
    Avro and Parquetare just as critical to ADAM as Spark ● Avro to define data models ● Parquet for serialization format ● Still need to answer design questions ○ how wide are the schemas? ○ how much do we follow existing formats? ○ how do carry through projections?
  • 14.
    Still need toconvince bioinformaticians to rewrite their software! Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 15.
    Still need toconvince bioinformaticians to rewrite their software! ● A single piece of a single filtering stage for a somatic variant caller ● “11-base-pair window centered on a candidate mutation” actually turns out to be optimized for a particular file format and sort order Cibulskis et al. Nature Biotechnology 31, 213–219 (2013)
  • 16.
    The Future: Distributedand Incremental? ● Today: 5k samples x 20 Gb / sample ● Tomorrow: 1m+ samples @ 200+ Gb / sample? ● More and more analysis is aggregative ○ joint variant calling, ○ panels of normal samples, ○ collective variant annotation ● And “data collection” will never be finished
  • 17.
    Acknowledgements Matt Massie(AMPLab) Frank Nothaft (AMPLab) Carl Yeksigian (DataStax) Anthony Philippakis (Broad Institute) Jeff Hammerbacher (Cloudera / Mt. Sinai) Thank you! (questions?)