SlideShare a Scribd company logo
Scaling up genomic 
analysis with ADAM 
Frank Austin Nothaft, UC Berkeley AMPLab 
fnothaft@berkeley.edu, @fnothaft 
11/20/2014
Credit: Matt Massie & NHGRI
The Sequencing Abstraction 
It was the best of times, it was the worst of times… 
the worst of 
It was the the best of 
worst of times 
times, it was 
• Humans have 46 chromosomes and each 
chromosome looks like a long strong 
• We get randomly distributed substrings, and want 
to reassemble original, whole string 
Metaphor borrowed from Michael Schatz 
best of times 
was the worst
Genomics = Big Data 
• Sequencing run produces >100 GB of raw data 
• Want to process 1,000’s of samples at once to 
improve statistical power 
• Current pipelines take about a week to run and are 
not horizontally scalable
How do we process a 
genome?
What’s our goal? 
• Human genome is 3.3B letters long, but our reads 
are only 50-250 letters long 
• Sequence of the average human genome is known 
• Insight: Each human genome only differs at 1 in 
1000 positions, so we can align short reads to 
average genome, and compute diff
Align Reads 
It was the best of times, it was the worst of times… 
best of times 
was the worst 
It was the the best of 
times, it was 
the worst of 
worst of times
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
best of times 
was the worst 
times, it was 
the worst of 
worst of times
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
best of times 
was the worst 
times, it was 
the worst of 
worst of times
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
best of times 
was the worst 
times, it was 
the worst of 
worst of times
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
the worst of 
best of times 
was the worst 
worst of times
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
the worst of 
best of times 
worst of times 
was the worst
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
the worst of 
best of times 
worst of times 
was the worst
Align Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
the worst of 
worst of times 
best of times 
was the worst
Assemble Reads 
It was the best of times, it was the worst of times… 
It was the 
the best of 
times, it was 
the worst of 
worst of times 
best of times 
was the worst
Assemble Reads 
It was the best of times, it was the worst of times… 
It was the best of times, it was 
the worst of 
worst of times 
best of times 
was the worst
Assemble Reads 
It was the best of times, it was the worst of times… 
It was the best of times, it was 
was the worst 
the worst of 
worst of times
Assemble Reads 
It was the best of times, it was the worst of times… 
It was the best of times, it was 
the worst 
the worst of 
worst of times
Assemble Reads 
It was the best of times, it was the worst of times… 
It was the best of times, it was the worst 
of 
worst of times
Assemble Reads 
It was the best of times, it was the worst of times… 
It was the best of times, it was the worst of times
Overall Pipeline Structure 
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
Overall Pipeline Structure 
End to end pipeline takes ~120 hours 
The stages take ~100 hours; ADAM works here 
From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
Making Genomics 
Horizontally Scalable
Key Observations 
• Current genomics pipelines are I/O limited 
• Most genomics algorithms can be formulated as 
either data/graph parallel computation 
• Genomics is heavy on iteration/pipelining, data 
access pattern is write once, read many times 
• High coverage, whole genome (>220 GB) will 
become main dataset for human genetics
ADAM Principles 
• Use schema as “narrow waist” 
• Columnar data representation + 
in-memory computing eliminates 
disk bandwidth bottleneck 
• Minimize data movement: send 
code to data 
Application 
Transformations 
Presentation 
Enriched Models 
Evidence Access 
MapReduce/DBMS 
Schema 
Data Models 
Materialized Data 
Columnar Storage 
Data Distribution 
Parallel FS/Sharding 
Physical Storage 
Disk
Data Independence 
• Many current genomics systems require data to be 
stored and processed in sorted order 
• This is an abstraction inversion! 
• Narrow waist at schema forces processing to be 
abstract from data, data to be abstract from disk 
• Do tricks at the processing level (fast coordinate-system 
joins) to give necessary programming 
abstractions
Data Format 
• Genomics algorithms frequently 
access global metadata 
• Schema is fully denormalized, 
allows O(1) access to metadata 
• Make all fields nullable to allow for 
arbitrary column projections 
• Avro enables literate 
programming 
record AlignmentRecord { 
union { null, Contig } contig = null; 
union { null, long } start = null; 
union { null, long } end = null; 
union { null, int } mapq = null; 
union { null, string } readName = null; 
union { null, string } sequence = null; 
union { null, string } mateReference = null; 
union { null, long } mateAlignmentStart = null; 
union { null, string } cigar = null; 
union { null, string } qual = null; 
union { null, string } recordGroupName = null; 
union { int, null } basesTrimmedFromStart = 0; 
union { int, null } basesTrimmedFromEnd = 0; 
union { boolean, null } readPaired = false; 
union { boolean, null } properPair = false; 
union { boolean, null } readMapped = false; 
union { boolean, null } mateMapped = false; 
union { boolean, null } firstOfPair = false; 
union { boolean, null } secondOfPair = false; 
union { boolean, null } failedVendorQualityChecks = false; 
union { boolean, null } duplicateRead = false; 
union { boolean, null } readNegativeStrand = false; 
union { boolean, null } mateNegativeStrand = false; 
union { boolean, null } primaryAlignment = false; 
union { boolean, null } secondaryAlignment = false; 
union { boolean, null } supplementaryAlignment = false; 
union { null, string } mismatchingPositions = null; 
union { null, string } origQual = null; 
union { null, string } attributes = null; 
union { null, string } recordGroupSequencingCenter = null; 
union { null, string } recordGroupDescription = null; 
union { null, long } recordGroupRunDateEpoch = null; 
union { null, string } recordGroupFlowOrder = null; 
union { null, string } recordGroupKeySequence = null; 
union { null, string } recordGroupLibrary = null; 
union { null, int } recordGroupPredictedMedianInsertSize = null; 
union { null, string } recordGroupPlatform = null; 
union { null, string } recordGroupPlatformUnit = null; 
union { null, string } recordGroupSample = null; 
union { null, Contig } mateContig = null; 
}
Parquet 
• ASF Incubator project, based on 
Google Dremel 
• http://www.parquet.io 
• High performance columnar 
store with support for projections 
and push-down predicates 
• 3 layers of parallelism: 
• File/row group 
• Column chunk 
• Page 
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Access to Remote Data 
• For genomics, we often have a really huge dataset 
which we only want to analyze part of 
• This dataset might be stored in S3/equivalent 
block store 
• Minimize data movement by allowing Parquet to 
support predicate pushdown/projections into S3 
• Work is in progress, found at https://github.com/ 
bigdatagenomics/adam/tree/multi-loader
Performance 
• Reduced pipeline time 
from 100 hrs to ~1hr 
• Linear speedup through 
128 nodes, when 
processing 234GB of data 
• For flagstat, columnar 
projection leads to a 5x 
speedup
ADAM Status 
• Apache 2 licensed OSS 
• 25 contributors across 10 institutions 
• Pushing for production 1.0 release towards end of year 
• Working with GA4GH to use concepts from ADAM to 
improve broader genomics data management techniques
Acknowledgements 
• UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos 
Kozanitis, Dave Patterson, Anthony Joseph 
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael 
Linderman, Jeff Hammerbacher 
• GenomeBridge: Timothy Danford, Carl Yeksigian 
• The Broad Institute: Chris Hartl 
• Cloudera: Uri Laserson 
• Microsoft Research: Jeremy Elson, Ravi Pandya 
• And other open source contributors, including Michael Heuer, Neil 
Ferguson, Andy Petrella, Xavier Tordoir!

More Related Content

Viewers also liked

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
Spark Summit
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
Mark Wong
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
Andy Petrella
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Spark Summit
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
Galder Zamarreño
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Ceph Community
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Daniel Krook
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
andrewmurraympc
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
Timothy Danford
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
Dan Crankshaw
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
jeykottalam
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and Concurrency
David Beazley (Dabeaz LLC)
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
Zubair Nabi
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark Summit
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
Hortonworks
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
David Beazley (Dabeaz LLC)
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter Lock
David Beazley (Dabeaz LLC)
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Thomas Graf
 

Viewers also liked (20)

The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
The Little Warehouse That Couldn't Or: How We Learned to Stop Worrying and Mo...
 
Linux Filesystems, RAID, and more
Linux Filesystems, RAID, and moreLinux Filesystems, RAID, and more
Linux Filesystems, RAID, and more
 
Distributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browserDistributed machine learning 101 using apache spark from the browser
Distributed machine learning 101 using apache spark from the browser
 
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
Lessons Learned with Spark at the US Patent & Trademark Office-(Christopher B...
 
The Hot Rod Protocol in Infinispan
The Hot Rod Protocol in InfinispanThe Hot Rod Protocol in Infinispan
The Hot Rod Protocol in Infinispan
 
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDSAccelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
Accelerating Cassandra Workloads on Ceph with All-Flash PCIE SSDS
 
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack SwiftAdvanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift
 
ELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot TimesELC-E 2010: The Right Approach to Minimal Boot Times
ELC-E 2010: The Right Approach to Minimal Boot Times
 
Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?Why is Bioinformatics a Good Fit for Spark?
Why is Bioinformatics a Good Fit for Spark?
 
Velox: Models in Action
Velox: Models in ActionVelox: Models in Action
Velox: Models in Action
 
SparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at ScaleSparkR: Enabling Interactive Data Science at Scale
SparkR: Enabling Interactive Data Science at Scale
 
SampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS StackSampleClean: Bringing Data Cleaning into the BDAS Stack
SampleClean: Bringing Data Cleaning into the BDAS Stack
 
A Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and ConcurrencyA Curious Course on Coroutines and Concurrency
A Curious Course on Coroutines and Concurrency
 
Lab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using MininetLab 5: Interconnecting a Datacenter using Mininet
Lab 5: Interconnecting a Datacenter using Mininet
 
OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2OpenStack Cheat Sheet V2
OpenStack Cheat Sheet V2
 
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
Spark on Mesos-A Deep Dive-(Dean Wampler and Tim Chen, Typesafe and Mesosphere)
 
Best Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache HadoopBest Practices for Virtualizing Apache Hadoop
Best Practices for Virtualizing Apache Hadoop
 
Python in Action (Part 2)
Python in Action (Part 2)Python in Action (Part 2)
Python in Action (Part 2)
 
In Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter LockIn Search of the Perfect Global Interpreter Lock
In Search of the Perfect Global Interpreter Lock
 
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
Taking Security Groups to Ludicrous Speed with OVS (OpenStack Summit 2015)
 

Similar to Scaling up genomic analysis with ADAM

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014fnothaft
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
c.titus.brown
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
fnothaft
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
fnothaft
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
Keith Bradnam
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
fnothaft
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
fnothaft
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
fnothaft
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talkc.titus.brown
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
fnothaft
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
Torsten Seemann
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Keith Bradnam
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
fnothaft
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsc.titus.brown
 

Similar to Scaling up genomic analysis with ADAM (20)

ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014ADAM—Spark Summit, 2014
ADAM—Spark Summit, 2014
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Probabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphsProbabilistic breakdown of assembly graphs
Probabilistic breakdown of assembly graphs
 

Recently uploaded

Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
ongomchris
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
ydteq
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
thanhdowork
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 

Recently uploaded (20)

Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
space technology lecture notes on satellite
space technology lecture notes on satellitespace technology lecture notes on satellite
space technology lecture notes on satellite
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
RAT: Retrieval Augmented Thoughts Elicit Context-Aware Reasoning in Long-Hori...
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 

Scaling up genomic analysis with ADAM

  • 1. Scaling up genomic analysis with ADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 11/20/2014
  • 3. The Sequencing Abstraction It was the best of times, it was the worst of times… the worst of It was the the best of worst of times times, it was • Humans have 46 chromosomes and each chromosome looks like a long strong • We get randomly distributed substrings, and want to reassemble original, whole string Metaphor borrowed from Michael Schatz best of times was the worst
  • 4. Genomics = Big Data • Sequencing run produces >100 GB of raw data • Want to process 1,000’s of samples at once to improve statistical power • Current pipelines take about a week to run and are not horizontally scalable
  • 5. How do we process a genome?
  • 6. What’s our goal? • Human genome is 3.3B letters long, but our reads are only 50-250 letters long • Sequence of the average human genome is known • Insight: Each human genome only differs at 1 in 1000 positions, so we can align short reads to average genome, and compute diff
  • 7. Align Reads It was the best of times, it was the worst of times… best of times was the worst It was the the best of times, it was the worst of worst of times
  • 8. Align Reads It was the best of times, it was the worst of times… It was the the best of best of times was the worst times, it was the worst of worst of times
  • 9. Align Reads It was the best of times, it was the worst of times… It was the the best of best of times was the worst times, it was the worst of worst of times
  • 10. Align Reads It was the best of times, it was the worst of times… It was the the best of best of times was the worst times, it was the worst of worst of times
  • 11. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of best of times was the worst worst of times
  • 12. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of best of times worst of times was the worst
  • 13. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of best of times worst of times was the worst
  • 14. Align Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of worst of times best of times was the worst
  • 15. Assemble Reads It was the best of times, it was the worst of times… It was the the best of times, it was the worst of worst of times best of times was the worst
  • 16. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst of worst of times best of times was the worst
  • 17. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was was the worst the worst of worst of times
  • 18. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst the worst of worst of times
  • 19. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst of worst of times
  • 20. Assemble Reads It was the best of times, it was the worst of times… It was the best of times, it was the worst of times
  • 21. Overall Pipeline Structure From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
  • 22. Overall Pipeline Structure End to end pipeline takes ~120 hours The stages take ~100 hours; ADAM works here From “GATK Best Practices”, https://www.broadinstitute.org/gatk/guide/best-practices
  • 24. Key Observations • Current genomics pipelines are I/O limited • Most genomics algorithms can be formulated as either data/graph parallel computation • Genomics is heavy on iteration/pipelining, data access pattern is write once, read many times • High coverage, whole genome (>220 GB) will become main dataset for human genetics
  • 25. ADAM Principles • Use schema as “narrow waist” • Columnar data representation + in-memory computing eliminates disk bandwidth bottleneck • Minimize data movement: send code to data Application Transformations Presentation Enriched Models Evidence Access MapReduce/DBMS Schema Data Models Materialized Data Columnar Storage Data Distribution Parallel FS/Sharding Physical Storage Disk
  • 26. Data Independence • Many current genomics systems require data to be stored and processed in sorted order • This is an abstraction inversion! • Narrow waist at schema forces processing to be abstract from data, data to be abstract from disk • Do tricks at the processing level (fast coordinate-system joins) to give necessary programming abstractions
  • 27. Data Format • Genomics algorithms frequently access global metadata • Schema is fully denormalized, allows O(1) access to metadata • Make all fields nullable to allow for arbitrary column projections • Avro enables literate programming record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; }
  • 28. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 29. Access to Remote Data • For genomics, we often have a really huge dataset which we only want to analyze part of • This dataset might be stored in S3/equivalent block store • Minimize data movement by allowing Parquet to support predicate pushdown/projections into S3 • Work is in progress, found at https://github.com/ bigdatagenomics/adam/tree/multi-loader
  • 30. Performance • Reduced pipeline time from 100 hrs to ~1hr • Linear speedup through 128 nodes, when processing 234GB of data • For flagstat, columnar projection leads to a 5x speedup
  • 31. ADAM Status • Apache 2 licensed OSS • 25 contributors across 10 institutions • Pushing for production 1.0 release towards end of year • Working with GA4GH to use concepts from ADAM to improve broader genomics data management techniques
  • 32. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis, Dave Patterson, Anthony Joseph • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, Carl Yeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • And other open source contributors, including Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordoir!