Scalable Genome Analysis
With ADAM
Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
7/23/2015
Analyzing genomes:
What is our goal?
• Genomes are the “source” code for life:
• The human genome is a 3.2B character
“program”, split across 46 “files”
• Within a species, genomes are ~99.9% similar
• The 0.1% variance gives rise to diverse traits, as
well as diseases
The Sequencing Abstraction
It was the best of times, it was the worst of times…
Metaphor borrowed from Michael Schatz
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
• Sequencing is a Poission substring sampling process
• For $1,000, we can sequence a 30x copy of your genome
Genome Resequencing
• The Human Genome Project identified the “average”
genome from 20 individuals at $1B cost
• To make this process cheaper, we use our knowledge
of the “average” genome to calculate a diff
• Two problems:
• How do we compute this diff?
• How do we make sense of the differences?
Alignment and Assembly
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times was the worst
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
What do genomic
analysis tools look like?
Genomics Pipelines
Source: The Broad Institute of MIT/Harvard
Flat File Formats
• Scientific data is typically stored in application
specific file formats:
• Genomic reads: SAM/BAM, CRAM
• Genomic variants: VCF/BCF, MAF
• Genomic features: BED, NarrowPeak, GTF
Flat Architectures
• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are flat architectures bad?
1. Trivial: low level abstractions are not productive
2. Trivial: flat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs
A green field approach
First, define a schema
record AlignmentRecord {	
union { null, Contig } contig = null;	
union { null, long } start = null;	
union { null, long } end = null;	
union { null, int } mapq = null;	
union { null, string } readName = null;	
union { null, string } sequence = null;	
union { null, string } mateReference = null;	
union { null, long } mateAlignmentStart = null;	
union { null, string } cigar = null;	
union { null, string } qual = null;	
union { null, string } recordGroupName = null;	
union { int, null } basesTrimmedFromStart = 0;	
union { int, null } basesTrimmedFromEnd = 0;	
union { boolean, null } readPaired = false;	
union { boolean, null } properPair = false;	
union { boolean, null } readMapped = false;	
union { boolean, null } mateMapped = false;	
union { boolean, null } firstOfPair = false;	
union { boolean, null } secondOfPair = false;	
union { boolean, null } failedVendorQualityChecks = false;	
union { boolean, null } duplicateRead = false;	
union { boolean, null } readNegativeStrand = false;	
union { boolean, null } mateNegativeStrand = false;	
union { boolean, null } primaryAlignment = false;	
union { boolean, null } secondaryAlignment = false;	
union { boolean, null } supplementaryAlignment = false;	
union { null, string } mismatchingPositions = null;	
union { null, string } origQual = null;	
union { null, string } attributes = null;	
union { null, string } recordGroupSequencingCenter = null;	
union { null, string } recordGroupDescription = null;	
union { null, long } recordGroupRunDateEpoch = null;	
union { null, string } recordGroupFlowOrder = null;	
union { null, string } recordGroupKeySequence = null;	
union { null, string } recordGroupLibrary = null;	
union { null, int } recordGroupPredictedMedianInsertSize = null;	
union { null, string } recordGroupPlatform = null;	
union { null, string } recordGroupPlatformUnit = null;	
union { null, string } recordGroupSample = null;	
union { null, Contig } mateContig = null;	
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
A schema provides a
narrow waistrecord AlignmentRecord {	
union { null, Contig } contig = null;	
union { null, long } start = null;	
union { null, long } end = null;	
union { null, int } mapq = null;	
union { null, string } readName = null;	
union { null, string } sequence = null;	
union { null, string } mateReference = null;	
union { null, long } mateAlignmentStart = null;	
union { null, string } cigar = null;	
union { null, string } qual = null;	
union { null, string } recordGroupName = null;	
union { int, null } basesTrimmedFromStart = 0;	
union { int, null } basesTrimmedFromEnd = 0;	
union { boolean, null } readPaired = false;	
union { boolean, null } properPair = false;	
union { boolean, null } readMapped = false;	
union { boolean, null } mateMapped = false;	
union { boolean, null } firstOfPair = false;	
union { boolean, null } secondOfPair = false;	
union { boolean, null } failedVendorQualityChecks = false;	
union { boolean, null } duplicateRead = false;	
union { boolean, null } readNegativeStrand = false;	
union { boolean, null } mateNegativeStrand = false;	
union { boolean, null } primaryAlignment = false;	
union { boolean, null } secondaryAlignment = false;	
union { boolean, null } supplementaryAlignment = false;	
union { null, string } mismatchingPositions = null;	
union { null, string } origQual = null;	
union { null, string } attributes = null;	
union { null, string } recordGroupSequencingCenter = null;	
union { null, string } recordGroupDescription = null;	
union { null, long } recordGroupRunDateEpoch = null;	
union { null, string } recordGroupFlowOrder = null;	
union { null, string } recordGroupKeySequence = null;	
union { null, string } recordGroupLibrary = null;	
union { null, int } recordGroupPredictedMedianInsertSize = null;	
union { null, string } recordGroupPlatform = null;	
union { null, string } recordGroupPlatformUnit = null;	
union { null, string } recordGroupSample = null;	
union { null, Contig } mateContig = null;	
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Accelerate common
access patterns
• In genomics, we commonly
have to find observations that
overlap in a coordinate plane
• This coordinate plane is
genomics specific, and is
known a priori
• We can use our knowledge of
the coordinate plane to
implement a fast overlap join
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Pick appropriate storage
• When accessing scientific
datasets, we frequently slice and
dice the dataset:
• Algorithms may touch
subsets of columns
• We don’t always touch the
whole dataset
• This is a good match for
columnar storage
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Is introducing a new data
model really a good idea?
Source: XKCD, http://xkcd.com/927/
A subtle point:!
Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to
serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models
A subtle point:!
Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models
Using the ADAM stack to
analyze genetic variants
What are the challenges?
• The differences between people’s genomes leads
to various traits and diseases, but:
• Variants don’t always have straightforward
explanations
• 3B base genome with variation at 0.1% of
locations —> lots of variants! How do we find
the important one?
Making Sense of Variation
• Variation in the genome can affect biology in
several ways:
• A variant can modify or break a protein
• A variant can modify how much of a protein is
created
• The subset of your genome that encodes proteins
is the exome. This is ~1% of your genome!
How do we link
variants to traits?
• Statistical modeling!
• We may use a simple model (e.g., a 𝛘2 test)
• Or, we may use a more complex model (e.g., linear
regression)
• This is known as a genotype-to-phenotype
association test
• Most of these tests can be expressed as aggregation
functions —> i.e., a reduction. This maps well to Spark!
What if two variants are
close to each other?
• This phenomenon is known as linkage disequilibrium: if
two variants are close to each other, they are likely to be
inherited together
Levo and Segal, Nat Rev Gen, 2014.
• We can often link expression changes and trait changes
to blocks of the genome, but not to specific variants
Can we break up
these blocks?
• When a variant modifies a gene, we can predict how the
gene is modified:
• Does this variant change how the protein is spliced?
• Does this variant change an amino acid?
• We can discard variants that do not change proteins
• However, if the variant is outside of a gene, how do we
make sense of it?
• Are these variants important?
Mutations in AML
There is a big “long tail”, including people who have
cancer, but who have no “modified” genes!
Ravi Pandya, BeatAML.
Looking outside
of the Exome
• We analyze mutations
in the exome using the
grammar for protein
creation
• Can we apply a similar
approach outside of
the exome?
• Let’s use the grammar
for regulation instead!
S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
Grammar Enhanced
Statistical Models
Levo and Segal, Nat Rev Gen, 2014.
S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
You Can Help!
• All of our projects are open source:
• https://www.github.com/bigdatagenomics
• Apache 2 licensed
• For a tutorial, see Chapter 10 of Advanced Analytics with
Spark
• More details on ADAM in our latest paper
• The GA4GH is looking for coders to help enable genomic
data sharing: https://github.com/ga4gh/server
Acknowledgements
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey
Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony
Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman,
Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson, Tom White!
• Microsoft Research: Ravi Pandya, Bill Bolosky!
• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau
Norgeot!
• And many other open source contributors, especially Michael Heuer, Neil
Ferguson, Andy Petrella, Xavier Tordior!
• Total of 40 contributors to ADAM/BDG from >12 institutions

Scalable Genome Analysis with ADAM

  • 1.
    Scalable Genome Analysis WithADAM Frank Austin Nothaft, UC Berkeley AMPLab fnothaft@berkeley.edu, @fnothaft 7/23/2015
  • 2.
    Analyzing genomes: What isour goal? • Genomes are the “source” code for life: • The human genome is a 3.2B character “program”, split across 46 “files” • Within a species, genomes are ~99.9% similar • The 0.1% variance gives rise to diverse traits, as well as diseases
  • 3.
    The Sequencing Abstraction Itwas the best of times, it was the worst of times… Metaphor borrowed from Michael Schatz It was the the best of times, it was the worst of worst of times best of times was the worst • Sequencing is a Poission substring sampling process • For $1,000, we can sequence a 30x copy of your genome
  • 4.
    Genome Resequencing • TheHuman Genome Project identified the “average” genome from 20 individuals at $1B cost • To make this process cheaper, we use our knowledge of the “average” genome to calculate a diff • Two problems: • How do we compute this diff? • How do we make sense of the differences?
  • 5.
    Alignment and Assembly Itwas the best of times, it was the worst of times… It was the the best of times, it was the worst of worst of times best of times was the worst It was the the best of times, it was the worst of worst of times best of times was the worst
  • 6.
    What do genomic analysistools look like?
  • 7.
    Genomics Pipelines Source: TheBroad Institute of MIT/Harvard
  • 8.
    Flat File Formats •Scientific data is typically stored in application specific file formats: • Genomic reads: SAM/BAM, CRAM • Genomic variants: VCF/BCF, MAF • Genomic features: BED, NarrowPeak, GTF
  • 9.
    Flat Architectures • APIspresent very barebones abstractions: • GATK: Sorted iterator over the genome • Why are flat architectures bad? 1. Trivial: low level abstractions are not productive 2. Trivial: flat architectures create technical lock-in 3. Subtle: low level abstractions can introduce bugs
  • 10.
  • 11.
    First, define aschema record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 12.
    Application Transformations Physical Storage Attached Storage DataDistribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models A schema provides a narrow waistrecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 13.
    Accelerate common access patterns •In genomics, we commonly have to find observations that overlap in a coordinate plane • This coordinate plane is genomics specific, and is known a priori • We can use our knowledge of the coordinate plane to implement a fast overlap join Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 14.
    Pick appropriate storage •When accessing scientific datasets, we frequently slice and dice the dataset: • Algorithms may touch subsets of columns • We don’t always touch the whole dataset • This is a good match for columnar storage Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  • 15.
    Is introducing anew data model really a good idea? Source: XKCD, http://xkcd.com/927/
  • 16.
    A subtle point:! Properstack design can simplify backwards compatibility To support legacy data formats, you define a way to serialize/deserialize the schema into/from the legacy flat file format! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  • 17.
    A subtle point:! Properstack design can simplify backwards compatibility This is a view! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  • 18.
    Using the ADAMstack to analyze genetic variants
  • 19.
    What are thechallenges? • The differences between people’s genomes leads to various traits and diseases, but: • Variants don’t always have straightforward explanations • 3B base genome with variation at 0.1% of locations —> lots of variants! How do we find the important one?
  • 20.
    Making Sense ofVariation • Variation in the genome can affect biology in several ways: • A variant can modify or break a protein • A variant can modify how much of a protein is created • The subset of your genome that encodes proteins is the exome. This is ~1% of your genome!
  • 21.
    How do welink variants to traits? • Statistical modeling! • We may use a simple model (e.g., a 𝛘2 test) • Or, we may use a more complex model (e.g., linear regression) • This is known as a genotype-to-phenotype association test • Most of these tests can be expressed as aggregation functions —> i.e., a reduction. This maps well to Spark!
  • 22.
    What if twovariants are close to each other? • This phenomenon is known as linkage disequilibrium: if two variants are close to each other, they are likely to be inherited together Levo and Segal, Nat Rev Gen, 2014. • We can often link expression changes and trait changes to blocks of the genome, but not to specific variants
  • 23.
    Can we breakup these blocks? • When a variant modifies a gene, we can predict how the gene is modified: • Does this variant change how the protein is spliced? • Does this variant change an amino acid? • We can discard variants that do not change proteins • However, if the variant is outside of a gene, how do we make sense of it? • Are these variants important?
  • 24.
    Mutations in AML Thereis a big “long tail”, including people who have cancer, but who have no “modified” genes! Ravi Pandya, BeatAML.
  • 25.
    Looking outside of theExome • We analyze mutations in the exome using the grammar for protein creation • Can we apply a similar approach outside of the exome? • Let’s use the grammar for regulation instead! S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
  • 26.
    Grammar Enhanced Statistical Models Levoand Segal, Nat Rev Gen, 2014. S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.
  • 27.
    You Can Help! •All of our projects are open source: • https://www.github.com/bigdatagenomics • Apache 2 licensed • For a tutorial, see Chapter 10 of Advanced Analytics with Spark • More details on ADAM in our latest paper • The GA4GH is looking for coders to help enable genomic data sharing: https://github.com/ga4gh/server
  • 28.
    Acknowledgements • UC Berkeley:Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Carl Yeksigian! • Cloudera: Uri Laserson, Tom White! • Microsoft Research: Ravi Pandya, Bill Bolosky! • UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau Norgeot! • And many other open source contributors, especially Michael Heuer, Neil Ferguson, Andy Petrella, Xavier Tordior! • Total of 40 contributors to ADAM/BDG from >12 institutions