ADAM—Spark Summit, 2014

F
ADAM: Fast, Scalable
Genome Analysis
Frank Austin Nothaft	

AMPLab, University of California, Berkeley, @fnothaft	

	

with: Matt Massie,André Schumacher,Timothy Danford, Chris Hartl, Jey
Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff
Hammerbacher,Anthony Joseph, and Dave Patterson	

	

https://github.com/bigdatagenomics	

http://www.bdgenomics.org
Problem
• Whole genome files are large	

• Biological systems are complex	

• Population analysis requires petabytes of
data	

• Analysis time is often a matter of life and
death
Whole Genome

Data Sizes
Input
Pipeline	

Stage
Output
SNAP
1GB Fasta	

150GB Fastq
Alignment 250GB BAM
ADAM 250GB BAM
Pre-
processing
200GB
ADAM
Avocado
200GB
ADAM
Variant
Calling
10MB ADAM
Variants found at about 1 in 1,000 loci
Shredded Book Analogy
Dickens accidentally shreds the first printing of A Tale of Two Cities	

Text printed on 5 long spools
• How can he reconstruct the text?	

– 5 copies x 138, 656 words / 5 words per fragment = 138k fragments	

– The short fragments from every copy are mixed together	

– Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Slide credit to Michael Schatz http://schatzlab.cshl.edu/
What is ADAM?
• File formats: columnar file format that
allows efficient parallel access to genomes	

• API: interface for transforming, analyzing,
and querying genomic data	

• CLI: a handy toolkit for quickly processing
genomes
Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud	

• Provide data format that has efficient
parallel/distributed access across platforms	

• Enhance semantics of data and allow more
flexible data access patterns
Implementation Overview
• 25K lines of Scala code	

• 100% Apache-licensed open-source	

• 18 contributors from 6 institutions	

• Working towards a production quality release late 2014
ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware	

‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed Filesystem	

‣Local Filesystem
‣Schema-driven records w/ Apache Avro	

‣Store and retrieve records using Parquet	

‣Read BAM Files using Hadoop-BAM
In-Memory
RDD
‣Transform records using Apache Spark	

‣Query with SQL using Shark	

‣Graph processing with GraphX	

‣Machine learning using MLBase
• OSS Created by Twitter and Cloudera, based on
Google Dremel	

• Columnar File Format:	

• Limits I/O to only data that is needed	

• Compresses very well - ADAM files are 5-25%
smaller than BAM files without loss of data	

• Fast scans - load only columns you need, e.g.
scan a read flag on a whole genome, high-
coverage file in less than a minute
Parquet
chrom20 TCGA 4M
chrom20 GAAT 4M1D
chrom20 CCGAT 5M
chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M
Column Oriented
chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M
Row Oriented
Predicate
Projection
Read Data
Cloud Optimizations
• Working on optimizations for loading
Parquet directly from S3	

• Building tools for changing cluster size as
spot prices fluctuate	

• Will separate code out for broader
community use
Scaling Genomics: BQSR
• DNA sequencers read 2% of sequence
incorrectly	

• Per base, estimate L(base is correct)	

• However, these estimates are poor, because
sequencers miss correlated errors
Empirical Error Rate
Spark BQSR
Implementation
• Broadcast 3 GB table of variants, used for
masking	

• Break reads down to bases and map bases
to covariates	

• Calculate empirical values per covariate	

• Broadcast observation, apply across reads
Future Work
• Pushing hard towards production release	

• Plan to release Python (possibly R) bindings	

• Work on interoperability with Global
Alliance for Genomic Health API (http://
genomicsandhealth.org/)
Call for contributions
• As an open source project, we welcome
contributions	

• We maintain a list of open enhancements at
our Github issue tracker	

• Enhancements tagged with “Pick me up!” don’t require a genomics background	

• Github: https://www.github.com/bdgenomics 	

• We’re also looking for two full time
engineers… see Matt Massie!
Acknowledgements
• UC Berkeley: Matt Massie,André Schumacher, Jey
Kottalam, Christos Kozanitis	

• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher	

• GenomeBridge: Timothy Danford, CarlYeksigian	

• The Broad Institute: Chris Hartl	

• Cloudera: Uri Laserson	

• Microsoft Research: Jeremy Elson, Ravi Pandya	

• And other open source contributors!
Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, and DARPA XData Award
FA8750-12-2-0331, and gifts from Amazon Web
Services, Google, SAP, The Thomas and Stacey
Siebel Foundation,Apple, Inc., C3Energy, Cisco,
Cloudera, EMC, Ericsson, Facebook, GameOnTalis,
Guavus, HP, Huawei, Intel, Microsoft, NetApp,
Pivotal, Splunk,Virdata,VMware,WANdisco and
Yahoo!.
1 of 18

More Related Content

What's hot(20)

Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
Timothy Danford669 views
Introduction to pigIntroduction to pig
Introduction to pig
Uday Vakalapudi643 views
CrateDB 101: Geospatial dataCrateDB 101: Geospatial data
CrateDB 101: Geospatial data
Claus Matzinger96 views

Similar to ADAM—Spark Summit, 2014

2015 illinois-talk2015 illinois-talk
2015 illinois-talkc.titus.brown
1.2K views63 slides
2015 09 emc lsug2015 09 emc lsug
2015 09 emc lsugChris Dwan
185 views56 slides
2014 sage-talk2014 sage-talk
2014 sage-talkc.titus.brown
790 views43 slides

Similar to ADAM—Spark Summit, 2014(20)

2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown1.1K views
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
c.titus.brown1.2K views
2015 09 emc lsug2015 09 emc lsug
2015 09 emc lsug
Chris Dwan185 views
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
PeterMorrell495 views
2014 sage-talk2014 sage-talk
2014 sage-talk
c.titus.brown790 views
2014 ucl2014 ucl
2014 ucl
c.titus.brown790 views
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
c.titus.brown1.3K views
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown3.5K views
Stamps.pptxStamps.pptx
Stamps.pptx
aaaa bbb1 view
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
fnothaft673 views
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
fnothaft893 views
ChipSeq Data AnalysisChipSeq Data Analysis
ChipSeq Data Analysis
COST action BM100622.6K views

Recently uploaded(20)

IEC 61850 Technical Overview.pdfIEC 61850 Technical Overview.pdf
IEC 61850 Technical Overview.pdf
ssusereeea975 views
IWISS Catalog 2022IWISS Catalog 2022
IWISS Catalog 2022
Iwiss Tools Co.,Ltd22 views
SWM L15-L28_drhasan (Part 2).pdfSWM L15-L28_drhasan (Part 2).pdf
SWM L15-L28_drhasan (Part 2).pdf
MahmudHasan74787025 views
Sanitary Landfill- SWM.pptxSanitary Landfill- SWM.pptx
Sanitary Landfill- SWM.pptx
Vinod Nejkar5 views
SNMPxSNMPx
SNMPx
Amatullahbutt10 views
String.pptxString.pptx
String.pptx
Ananthi Palanisamy45 views
Dynamics of Hard-Magnetic Soft MaterialsDynamics of Hard-Magnetic Soft Materials
Dynamics of Hard-Magnetic Soft Materials
Shivendra Nandan11 views
Pointers.pptxPointers.pptx
Pointers.pptx
Ananthi Palanisamy58 views
Paper 3.pdfPaper 3.pdf
Paper 3.pdf
Javad Kadkhodapour6 views
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
harinsrikanth40 views
Stone Masonry and Brick Masonry.pdfStone Masonry and Brick Masonry.pdf
Stone Masonry and Brick Masonry.pdf
Mohammed Abdullah Laskar17 views
Codes and Conventions.pptxCodes and Conventions.pptx
Codes and Conventions.pptx
IsabellaGraceAnkers5 views
CHI-SQUARE ( χ2) TESTS.pptxCHI-SQUARE ( χ2) TESTS.pptx
CHI-SQUARE ( χ2) TESTS.pptx
ssusera597c511 views
Electrical CrimpingElectrical Crimping
Electrical Crimping
Iwiss Tools Co.,Ltd18 views

ADAM—Spark Summit, 2014

  • 1. ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley, @fnothaft with: Matt Massie,André Schumacher,Timothy Danford, Chris Hartl, Jey Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher,Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics http://www.bdgenomics.org
  • 2. Problem • Whole genome files are large • Biological systems are complex • Population analysis requires petabytes of data • Analysis time is often a matter of life and death
  • 3. Whole Genome
 Data Sizes Input Pipeline Stage Output SNAP 1GB Fasta 150GB Fastq Alignment 250GB BAM ADAM 250GB BAM Pre- processing 200GB ADAM Avocado 200GB ADAM Variant Calling 10MB ADAM Variants found at about 1 in 1,000 loci
  • 4. Shredded Book Analogy Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools • How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, … It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness, It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, … It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, … It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, … It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness, It was the the worst of times, itbest of times, it was was the age of wisdom, it was the age of foolishness, … It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, … It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, … Slide credit to Michael Schatz http://schatzlab.cshl.edu/
  • 5. What is ADAM? • File formats: columnar file format that allows efficient parallel access to genomes • API: interface for transforming, analyzing, and querying genomic data • CLI: a handy toolkit for quickly processing genomes
  • 6. Design Goals • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms • Enhance semantics of data and allow more flexible data access patterns
  • 7. Implementation Overview • 25K lines of Scala code • 100% Apache-licensed open-source • 18 contributors from 6 institutions • Working towards a production quality release late 2014
  • 8. ADAM Stack Physical File/Block Record/Split ‣Commodity Hardware ‣Cloud Systems - Amazon, GCE, Azure ‣Hadoop Distributed Filesystem ‣Local Filesystem ‣Schema-driven records w/ Apache Avro ‣Store and retrieve records using Parquet ‣Read BAM Files using Hadoop-BAM In-Memory RDD ‣Transform records using Apache Spark ‣Query with SQL using Shark ‣Graph processing with GraphX ‣Machine learning using MLBase
  • 9. • OSS Created by Twitter and Cloudera, based on Google Dremel • Columnar File Format: • Limits I/O to only data that is needed • Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high- coverage file in less than a minute Parquet
  • 10. chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M Column Oriented chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M Row Oriented Predicate Projection Read Data
  • 11. Cloud Optimizations • Working on optimizations for loading Parquet directly from S3 • Building tools for changing cluster size as spot prices fluctuate • Will separate code out for broader community use
  • 12. Scaling Genomics: BQSR • DNA sequencers read 2% of sequence incorrectly • Per base, estimate L(base is correct) • However, these estimates are poor, because sequencers miss correlated errors
  • 14. Spark BQSR Implementation • Broadcast 3 GB table of variants, used for masking • Break reads down to bases and map bases to covariates • Calculate empirical values per covariate • Broadcast observation, apply across reads
  • 15. Future Work • Pushing hard towards production release • Plan to release Python (possibly R) bindings • Work on interoperability with Global Alliance for Genomic Health API (http:// genomicsandhealth.org/)
  • 16. Call for contributions • As an open source project, we welcome contributions • We maintain a list of open enhancements at our Github issue tracker • Enhancements tagged with “Pick me up!” don’t require a genomics background • Github: https://www.github.com/bdgenomics • We’re also looking for two full time engineers… see Matt Massie!
  • 17. Acknowledgements • UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam, Christos Kozanitis • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, CarlYeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • And other open source contributors!
  • 18. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation,Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,Virdata,VMware,WANdisco and Yahoo!.