SlideShare a Scribd company logo
Scaling up genomic 
analysis with ADAM 
Frank Austin Nothaft, UC Berkeley AMPLab 
fnothaft@berkeley.edu, @fnothaft 
12/8/2014
Data Intensive Genomics 
• Scale of genomic analyses is growing rapidly: 
• New experiments sequence 10-100k samples 
• Use high coverage, WGS for variant analyses 
• 100k samples @ 60x WGS will generate ~20PB of 
read data and ~300TB of genotype data
Petabytes Cause Problems 
1. Analysis systems must be horizontally scalable 
without substantial programmer overhead 
2. Data storage format must compress well while 
providing good read performance 
3. Need to efficiently slice and dice dataset: not all 
users want the same views or subsets of data
Analysis Characteristics 
• Current genomics pipelines are limited by I/O 
• Most genomics algorithms can be formulated as a 
data or graph parallel computation 
• Analysis algorithms use iteration and pipelining 
• Reference genome/experiment metadata access 
must be cheap! —> impacts analysis performance
What is ADAM? 
• An open source, high performance, distributed 
platform for genomic analysis 
• ADAM defines a: 
1. Data schema and layout on disk* 
2. A Scala API 
3. A command line interface 
* Via Avro and Parquet
Principles for Scalable 
Design in ADAM 
• Reuse commodity horizontally scalable systems 
• Parallel FS and data representation (HDFS + 
Parquet) combined with in-memory computing 
eliminates disk bandwidth bottleneck 
• Spark provides horizontally scalable iterative/ 
pipelined Map-Reduce 
• Minimize data movement: send code to data, 
efficiently encode metadata
• An in-memory data parallel computing framework 
• Optimized for iterative jobs —> unlike Hadoop 
• Data maintained in memory unless inter-node 
movement needed (e.g., on repartitioning) 
• Presents a functional programing API, along with support 
for iterative programming via REPL 
• Set Daytona Greysort record (100TB in 23 min, 206 nodes)
Data Format 
• Avro schema encoded by Parquet 
• Schema can be updated without 
breaking backwards compatibility 
• Normalize metadata fields into 
schema for O(1) metadata access 
• Genotype schema is strictly 
biallelic, a “cell in the matrix” 
record AlignmentRecord { 
union { null, Contig } contig = null; 
union { null, long } start = null; 
union { null, long } end = null; 
union { null, int } mapq = null; 
union { null, string } readName = null; 
union { null, string } sequence = null; 
union { null, string } mateReference = null; 
union { null, long } mateAlignmentStart = null; 
union { null, string } cigar = null; 
union { null, string } qual = null; 
union { null, string } recordGroupName = null; 
union { int, null } basesTrimmedFromStart = 0; 
union { int, null } basesTrimmedFromEnd = 0; 
union { boolean, null } readPaired = false; 
union { boolean, null } properPair = false; 
union { boolean, null } readMapped = false; 
union { boolean, null } mateMapped = false; 
union { boolean, null } firstOfPair = false; 
union { boolean, null } secondOfPair = false; 
union { boolean, null } failedVendorQualityChecks = false; 
union { boolean, null } duplicateRead = false; 
union { boolean, null } readNegativeStrand = false; 
union { boolean, null } mateNegativeStrand = false; 
union { boolean, null } primaryAlignment = false; 
union { boolean, null } secondaryAlignment = false; 
union { boolean, null } supplementaryAlignment = false; 
union { null, string } mismatchingPositions = null; 
union { null, string } origQual = null; 
union { null, string } attributes = null; 
union { null, string } recordGroupSequencingCenter = null; 
union { null, string } recordGroupDescription = null; 
union { null, long } recordGroupRunDateEpoch = null; 
union { null, string } recordGroupFlowOrder = null; 
union { null, string } recordGroupKeySequence = null; 
union { null, string } recordGroupLibrary = null; 
union { null, int } recordGroupPredictedMedianInsertSize = null; 
union { null, string } recordGroupPlatform = null; 
union { null, string } recordGroupPlatformUnit = null; 
union { null, string } recordGroupSample = null; 
union { null, Contig} mateContig = null; 
}
Parquet 
• ASF Incubator project, based on 
Google Dremel 
• http://www.parquet.io 
• High performance columnar 
store with support for projections 
and push-down predicates 
• 3 layers of parallelism: 
• File/row group 
• Column chunk 
• Page 
Image from Parquet format definition: https://github.com/Parquet/parquet-format
Big Data in Parquet 
• ADAM in Parquet provides a 25% improvement over 
compressed BAM 
• Enables efficient slice-and-dice: 
• Can select column projections —> reduce I/O 
• Support pushdown predicates for efficient filtering 
• Have Parquet/S3 integration to push computing 
down into remote block stores for cold data
Scalability 
• Evaluated on 1000G WGS 
NA12878, 234GB dataset 
• Used 32-128 m2.4xlarge, 1 
cr1.8xlarge from AWS 
• Achieve linear scalability out 
to 128 nodes for most tasks 
• 2-4x improvement vs {GATK, 
samtools/Picard} on single 
machine for most tasks
Long-read assembly 
with PacMin
The State of Analysis 
• Conventional short-read alignment based pipelines 
are really good at calling SNPs 
• Need improvement at calling INDELs and SVs 
• And are slow: 2 weeks to sequence, 1 week to 
analyze. Not fast enough. 
• If we move away from short reads, do we have other 
options?
Opportunities 
• New read technologies are available 
• Provide much longer reads (250bp vs. >10kbp) 
• Different error model… (15% INDEL errors, vs. 2% 
SNP errors) 
• Generally, lower sequence specific bias 
Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
If long reads are available… 
• We can use conventional methods: 
Carneiro et al, Genome Biology 2012
But! 
• Why not make raw assemblies out of the reads? 
Find overlapping reads Find consensus sequence 
for all pairs of reads (i,j): 
i j 
=? 
…ACACTGCGACTCATCGACTC… 
• Problems: 
1. Overlapping is O(n 
2 
) and single evaluation is expensive anyways 
2. Typical algorithms find a single consensus sequence; what if we’ve got 
polymorphisms?
Fast Overlapping with 
MinHashing 
• Wonderful realization by Berlin et al1: overlapping is 
similar to document similarity problem 
• Use MinHashing to approximate similarity: 
1: Berlin et al, bioRxiv 2014 
Per document/read, 
compute signature:! 
! 
1. Cut into shingles 
2. Apply random 
hashes to shingles 
3. Take min over all 
random hashes 
Hash into buckets:! 
! 
Signatures of length l 
can be hashed into b 
buckets, so we expect 
to compare all elements 
with similarity 
≥ (1/b)^(b/l) 
Compare:! 
! 
For two documents with 
signatures of length l, 
Jaccard similarity is 
estimated by 
(# equal hashes) / l 
! 
• Easy to implement in Spark: map, groupBy, map, filter
Overlaps to Assemblies 
• Finding pairwise overlaps gives us a directed 
graph between reads (lots of edges!)
Transitive Reduction 
• We can find a consensus between clique members 
• Or, we can reduce down: 
• Via two iterations of Pregel!
Monoallelic Sequence Model 
• Traditional probabilistic models assume independence 
at each site and a good reference model 
• This discards information about local sequence context 
• Can consider a different formulation of the problem: 
• Per reduced segment, build a graph of the alleles 
• Find the allelic copy numbers that maximize 
segment probability
Allele Graphs 
ACACTCG 
C 
A 
TCTCA 
G 
C 
• Edges of graph define conditional probabilities 
! 
! 
TCCACACT 
• Can efficiently marginalize probabilities over graph using Eliminate 
algorithm1, exactly solve for argmax 
1. Jordan, “Probabilistic Graphical Models.” 
Notes:! 
X = copy number of this allele 
Y = copy number of preceding allele 
k = number of reads observed 
j = number of reads supporting Y —> X transition 
Pi = probability that read i supports Y —> X transition
Output 
• Current assemblers emit FASTA contigs 
• We’ll emit “multigs”, which we’ll map back to reference 
graph 
• Multig = multi-allelic (polymorphic) contig 
• Will include a confidence score per base 
• Working with UCSC, who’ve done some really neat work1 
deriving formalisms & building software for mapping 
between sequence graphs, and GA4GH ref. variation team 
1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
Acknowledgements 
• UC Berkeley: Matt Massie, André Schumacher, 
Jey Kottalam, Christos Kozanitis, Adam Bloniarz! 
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael 
Linderman, Jeff Hammerbacher! 
• GenomeBridge: Timothy Danford, Carl Yeksigian! 
• Cloudera: Uri Laserson! 
• Microsoft Research: Jeremy Elson, Ravi Pandya! 
• And many other open source contributors: 26 
contributors to ADAM/BDG from >11 institutions

More Related Content

What's hot

Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
fnothaft
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
fnothaft
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
Andy Petrella
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Andy Petrella
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
StampedeCon
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
Dataconomy Media
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
Timothy Danford
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteMatt Massie
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
Uri Laserson
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
Allen Day, PhD
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
Anqi Fu
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shangSAIL_QU
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Kim Herzig
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Spark Summit
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
Sri Ambati
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
Databricks
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
Jean-Paul Calbimonte
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
Jean-Paul Calbimonte
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
University of California, San Diego
 

What's hot (20)

Scalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAMScalable Genome Analysis With ADAM
Scalable Genome Analysis With ADAM
 
Scalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAMScalable Genome Analysis with ADAM
Scalable Genome Analysis with ADAM
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
Data Enthusiasts London: Scalable and Interoperable data services. Applied to...
 
Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014Managing Genomes At Scale: What We Learned - StampedeCon 2014
Managing Genomes At Scale: What We Learned - StampedeCon 2014
 
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"..."Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
"Spark, Deep Learning and Life Sciences, Systems Biology in the Big Data Age"...
 
Spark Summit East 2015
Spark Summit East 2015Spark Summit East 2015
Spark Summit East 2015
 
Ga4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger instituteGa4 gh meeting at the the sanger institute
Ga4 gh meeting at the the sanger institute
 
Genomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive BiologyGenomics Is Not Special: Towards Data Intensive Biology
Genomics Is Not Special: Towards Data Intensive Biology
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Genome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAMGenome Analysis Pipelines with Spark and ADAM
Genome Analysis Pipelines with Spark and ADAM
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 
Ase2010 shang
Ase2010 shangAse2010 shang
Ase2010 shang
 
Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)Mining and Untangling Change Genealogies (PhD Defense Talk)
Mining and Untangling Change Genealogies (PhD Defense Talk)
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...H2O World -  Sparkling water on the Spark Notebook: Interactive Genomes Clust...
H2O World - Sparkling water on the Spark Notebook: Interactive Genomes Clust...
 
Enabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQLEnabling Biobank-Scale Genomic Processing with Spark SQL
Enabling Biobank-Scale Genomic Processing with Spark SQL
 
RDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of SemanticsRDF Stream Processing and the role of Semantics
RDF Stream Processing and the role of Semantics
 
RDF Stream Processing: Let's React
RDF Stream Processing: Let's ReactRDF Stream Processing: Let's React
RDF Stream Processing: Let's React
 
The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...The Materials Project - Combining Science and Informatics to Accelerate Mater...
The Materials Project - Combining Science and Informatics to Accelerate Mater...
 

Viewers also liked

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
Matt Massie
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Allen Day, PhD
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1
MapR Technologies
 
Developing openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesDeveloping openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesPablo Pazos
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
Allen Day, PhD
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
Amazon Web Services
 

Viewers also liked (6)

Strata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAMStrata Big Data Science Talk on ADAM
Strata Big Data Science Talk on ADAM
 
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGIHadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
Hadoop and Genomics - What you need to know - 2015.04.09 - Shenzhen - BGI
 
Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1Free Code Friday: Genome Resequencing with Spark, Part 1
Free Code Friday: Genome Resequencing with Spark, Part 1
 
Developing openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalitiesDeveloping openEHR EHRs - core functionalities
Developing openEHR EHRs - core functionalities
 
Genomics isn't Special
Genomics isn't SpecialGenomics isn't Special
Genomics isn't Special
 
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 20162016 AWS Life Sciences Days | Boston, MA – May 17, 2016
2016 AWS Life Sciences Days | Boston, MA – May 17, 2016
 

Similar to Scaling up genomic analysis with ADAM

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Spark Summit
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
huguk
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
Anubhav Jain
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
fnothaft
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
Anubhav Jain
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
fnothaft
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
jins0618
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
Anubhav Jain
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
SnappyData
 
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulationsSFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
South Tyrol Free Software Conference
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
Anubhav Jain
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Herman Wu
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Spark Summit
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Databricks
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
Prof. Wim Van Criekinge
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
Data Con LA
 

Similar to Scaling up genomic analysis with ADAM (20)

Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...Atomate: a high-level interface to generate, execute, and analyze computation...
Atomate: a high-level interface to generate, execute, and analyze computation...
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
Scaling Genomic Analyses
Scaling Genomic AnalysesScaling Genomic Analyses
Scaling Genomic Analyses
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems Rethinking Data-Intensive Science Using Scalable Analytics Systems
Rethinking Data-Intensive Science Using Scalable Analytics Systems
 
Ling liu part 02:big graph processing
Ling liu part 02:big graph processingLing liu part 02:big graph processing
Ling liu part 02:big graph processing
 
Automating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomateAutomating materials science workflows with pymatgen, FireWorks, and atomate
Automating materials science workflows with pymatgen, FireWorks, and atomate
 
Intro to SnappyData Webinar
Intro to SnappyData WebinarIntro to SnappyData Webinar
Intro to SnappyData Webinar
 
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulationsSFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
SFSCON23 - Yaman Güçlü - Psydac a Python IGA library for large-scale simulations
 
Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...Software tools, crystal descriptors, and machine learning applied to material...
Software tools, crystal descriptors, and machine learning applied to material...
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
User biglm
User biglmUser biglm
User biglm
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue SnappyData Overview Slidedeck for Big Data Bellevue
SnappyData Overview Slidedeck for Big Data Bellevue
 
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
Alchemist: An Apache Spark <=> MPI Interface with Michael Mahoney and Kai Rot...
 
Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Impala presentation ahad rana
Impala presentation ahad ranaImpala presentation ahad rana
Impala presentation ahad rana
 

More from fnothaft

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
fnothaft
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
fnothaft
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
fnothaft
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
fnothaft
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
fnothaft
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114fnothaft
 

More from fnothaft (6)

Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
Reproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral ModelsReproducible Emulation of Analog Behavioral Models
Reproducible Emulation of Analog Behavioral Models
 
CS176: Genome Assembly
CS176: Genome AssemblyCS176: Genome Assembly
CS176: Genome Assembly
 
Execution Environments
Execution EnvironmentsExecution Environments
Execution Environments
 
PacMin @ AMPLab All-Hands
PacMin @ AMPLab All-HandsPacMin @ AMPLab All-Hands
PacMin @ AMPLab All-Hands
 
Adam bosc-071114
Adam bosc-071114Adam bosc-071114
Adam bosc-071114
 

Recently uploaded

Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Dr.Costas Sachpazis
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
gerogepatton
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
JoytuBarua2
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
PrashantGoswami42
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Teleport Manpower Consultant
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
Osamah Alsalih
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
MdTanvirMahtab2
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
Kamal Acharya
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 

Recently uploaded (20)

Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...
 
Immunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary AttacksImmunizing Image Classifiers Against Localized Adversary Attacks
Immunizing Image Classifiers Against Localized Adversary Attacks
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
Planning Of Procurement o different goods and services
Planning Of Procurement o different goods and servicesPlanning Of Procurement o different goods and services
Planning Of Procurement o different goods and services
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdfTop 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
Top 10 Oil and Gas Projects in Saudi Arabia 2024.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
MCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdfMCQ Soil mechanics questions (Soil shear strength).pdf
MCQ Soil mechanics questions (Soil shear strength).pdf
 
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)
 
Cosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdfCosmetic shop management system project report.pdf
Cosmetic shop management system project report.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 

Scaling up genomic analysis with ADAM

  • 1. Scaling up genomic analysis with ADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 12/8/2014
  • 2. Data Intensive Genomics • Scale of genomic analyses is growing rapidly: • New experiments sequence 10-100k samples • Use high coverage, WGS for variant analyses • 100k samples @ 60x WGS will generate ~20PB of read data and ~300TB of genotype data
  • 3. Petabytes Cause Problems 1. Analysis systems must be horizontally scalable without substantial programmer overhead 2. Data storage format must compress well while providing good read performance 3. Need to efficiently slice and dice dataset: not all users want the same views or subsets of data
  • 4. Analysis Characteristics • Current genomics pipelines are limited by I/O • Most genomics algorithms can be formulated as a data or graph parallel computation • Analysis algorithms use iteration and pipelining • Reference genome/experiment metadata access must be cheap! —> impacts analysis performance
  • 5. What is ADAM? • An open source, high performance, distributed platform for genomic analysis • ADAM defines a: 1. Data schema and layout on disk* 2. A Scala API 3. A command line interface * Via Avro and Parquet
  • 6. Principles for Scalable Design in ADAM • Reuse commodity horizontally scalable systems • Parallel FS and data representation (HDFS + Parquet) combined with in-memory computing eliminates disk bandwidth bottleneck • Spark provides horizontally scalable iterative/ pipelined Map-Reduce • Minimize data movement: send code to data, efficiently encode metadata
  • 7. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed (e.g., on repartitioning) • Presents a functional programing API, along with support for iterative programming via REPL • Set Daytona Greysort record (100TB in 23 min, 206 nodes)
  • 8. Data Format • Avro schema encoded by Parquet • Schema can be updated without breaking backwards compatibility • Normalize metadata fields into schema for O(1) metadata access • Genotype schema is strictly biallelic, a “cell in the matrix” record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig} mateContig = null; }
  • 9. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  • 10. Big Data in Parquet • ADAM in Parquet provides a 25% improvement over compressed BAM • Enables efficient slice-and-dice: • Can select column projections —> reduce I/O • Support pushdown predicates for efficient filtering • Have Parquet/S3 integration to push computing down into remote block stores for cold data
  • 11. Scalability • Evaluated on 1000G WGS NA12878, 234GB dataset • Used 32-128 m2.4xlarge, 1 cr1.8xlarge from AWS • Achieve linear scalability out to 128 nodes for most tasks • 2-4x improvement vs {GATK, samtools/Picard} on single machine for most tasks
  • 13. The State of Analysis • Conventional short-read alignment based pipelines are really good at calling SNPs • Need improvement at calling INDELs and SVs • And are slow: 2 weeks to sequence, 1 week to analyze. Not fast enough. • If we move away from short reads, do we have other options?
  • 14. Opportunities • New read technologies are available • Provide much longer reads (250bp vs. >10kbp) • Different error model… (15% INDEL errors, vs. 2% SNP errors) • Generally, lower sequence specific bias Left: PacBio homepage, Right: Wired, http://www.wired.com/2012/03/oxford-nanopore-sequencing-usb/
  • 15. If long reads are available… • We can use conventional methods: Carneiro et al, Genome Biology 2012
  • 16. But! • Why not make raw assemblies out of the reads? Find overlapping reads Find consensus sequence for all pairs of reads (i,j): i j =? …ACACTGCGACTCATCGACTC… • Problems: 1. Overlapping is O(n 2 ) and single evaluation is expensive anyways 2. Typical algorithms find a single consensus sequence; what if we’ve got polymorphisms?
  • 17. Fast Overlapping with MinHashing • Wonderful realization by Berlin et al1: overlapping is similar to document similarity problem • Use MinHashing to approximate similarity: 1: Berlin et al, bioRxiv 2014 Per document/read, compute signature:! ! 1. Cut into shingles 2. Apply random hashes to shingles 3. Take min over all random hashes Hash into buckets:! ! Signatures of length l can be hashed into b buckets, so we expect to compare all elements with similarity ≥ (1/b)^(b/l) Compare:! ! For two documents with signatures of length l, Jaccard similarity is estimated by (# equal hashes) / l ! • Easy to implement in Spark: map, groupBy, map, filter
  • 18. Overlaps to Assemblies • Finding pairwise overlaps gives us a directed graph between reads (lots of edges!)
  • 19. Transitive Reduction • We can find a consensus between clique members • Or, we can reduce down: • Via two iterations of Pregel!
  • 20. Monoallelic Sequence Model • Traditional probabilistic models assume independence at each site and a good reference model • This discards information about local sequence context • Can consider a different formulation of the problem: • Per reduced segment, build a graph of the alleles • Find the allelic copy numbers that maximize segment probability
  • 21. Allele Graphs ACACTCG C A TCTCA G C • Edges of graph define conditional probabilities ! ! TCCACACT • Can efficiently marginalize probabilities over graph using Eliminate algorithm1, exactly solve for argmax 1. Jordan, “Probabilistic Graphical Models.” Notes:! X = copy number of this allele Y = copy number of preceding allele k = number of reads observed j = number of reads supporting Y —> X transition Pi = probability that read i supports Y —> X transition
  • 22. Output • Current assemblers emit FASTA contigs • We’ll emit “multigs”, which we’ll map back to reference graph • Multig = multi-allelic (polymorphic) contig • Will include a confidence score per base • Working with UCSC, who’ve done some really neat work1 deriving formalisms & building software for mapping between sequence graphs, and GA4GH ref. variation team 1. Paten et al, “Mapping to a Reference Genome Structure”, arXiv 2014.
  • 23. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis, Adam Bloniarz! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Timothy Danford, Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Jeremy Elson, Ravi Pandya! • And many other open source contributors: 26 contributors to ADAM/BDG from >11 institutions