Scalable Genome Analysis with ADAM

Scalable Genome Analysis
With ADAM
Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
7/23/2015

Analyzing genomes:
What is our goal?
• Genomes are the “source” code for life:
• The human genome is a 3.2B character
“program”, split across 46 “ﬁles”
• Within a species, genomes are ~99.9% similar
• The 0.1% variance gives rise to diverse traits, as
well as diseases

The Sequencing Abstraction
It was the best of times, it was the worst of times…
Metaphor borrowed from Michael Schatz
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
• Sequencing is a Poission substring sampling process
• For $1,000, we can sequence a 30x copy of your genome

Genome Resequencing
• The Human Genome Project identiﬁed the “average”
genome from 20 individuals at $1B cost
• To make this process cheaper, we use our knowledge
of the “average” genome to calculate a diff
• Two problems:
• How do we compute this diff?
• How do we make sense of the differences?

Alignment and Assembly
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times was the worst
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst

What do genomic
analysis tools look like?

Genomics Pipelines
Source: The Broad Institute of MIT/Harvard

Flat File Formats
• Scientific data is typically stored in application
specific file formats:
• Genomic reads: SAM/BAM, CRAM
• Genomic variants: VCF/BCF, MAF
• Genomic features: BED, NarrowPeak, GTF

Flat Architectures
• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are ﬂat architectures bad?
1. Trivial: low level abstractions are not productive
2. Trivial: ﬂat architectures create technical lock-in
3. Subtle: low level abstractions can introduce bugs

First, deﬁne a schema
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
A schema provides a
narrow waistrecord AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Accelerate common
access patterns
• In genomics, we commonly
have to ﬁnd observations that
overlap in a coordinate plane
• This coordinate plane is
genomics speciﬁc, and is
known a priori
• We can use our knowledge of
the coordinate plane to
implement a fast overlap join
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Pick appropriate storage
• When accessing scientiﬁc
datasets, we frequently slice and
dice the dataset:
• Algorithms may touch
subsets of columns
• We don’t always touch the
whole dataset
• This is a good match for
columnar storage
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models

Is introducing a new data
model really a good idea?
Source: XKCD, http://xkcd.com/927/

A subtle point:!
Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way to
serialize/deserialize the schema into/from the
legacy flat file format!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models

A subtle point:!
Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized Data
Legacy File Format
Schema
Data Models
Data Distribution
Materialized Data
Columnar Storage
Schema
Data Models

Using the ADAM stack to
analyze genetic variants

What are the challenges?
• The differences between people’s genomes leads
to various traits and diseases, but:
• Variants don’t always have straightforward
explanations
• 3B base genome with variation at 0.1% of
locations —> lots of variants! How do we ﬁnd
the important one?

Making Sense of Variation
• Variation in the genome can affect biology in
several ways:
• A variant can modify or break a protein
• A variant can modify how much of a protein is
created
• The subset of your genome that encodes proteins
is the exome. This is ~1% of your genome!

How do we link
variants to traits?
• Statistical modeling!
• We may use a simple model (e.g., a 𝛘2 test)
• Or, we may use a more complex model (e.g., linear
regression)
• This is known as a genotype-to-phenotype
association test
• Most of these tests can be expressed as aggregation
functions —> i.e., a reduction. This maps well to Spark!

What if two variants are
close to each other?
• This phenomenon is known as linkage disequilibrium: if
two variants are close to each other, they are likely to be
inherited together
Levo and Segal, Nat Rev Gen, 2014.
• We can often link expression changes and trait changes
to blocks of the genome, but not to speciﬁc variants

Can we break up
these blocks?
• When a variant modiﬁes a gene, we can predict how the
gene is modiﬁed:
• Does this variant change how the protein is spliced?
• Does this variant change an amino acid?
• We can discard variants that do not change proteins
• However, if the variant is outside of a gene, how do we
make sense of it?
• Are these variants important?

Mutations in AML
There is a big “long tail”, including people who have
cancer, but who have no “modiﬁed” genes!
Ravi Pandya, BeatAML.

Looking outside
of the Exome
• We analyze mutations
in the exome using the
grammar for protein
creation
• Can we apply a similar
approach outside of
the exome?
• Let’s use the grammar
for regulation instead!
S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

Grammar Enhanced
Statistical Models
Levo and Segal, Nat Rev Gen, 2014.
S. Weingarten-Gabbay and E. Segal, Human Genetics, 2014.

You Can Help!
• All of our projects are open source:
• https://www.github.com/bigdatagenomics
• Apache 2 licensed
• For a tutorial, see Chapter 10 of Advanced Analytics with
Spark
• More details on ADAM in our latest paper
• The GA4GH is looking for coders to help enable genomic
data sharing: https://github.com/ga4gh/server

Acknowledgements
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey
Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony
Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman,
Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson, Tom White!
• Microsoft Research: Ravi Pandya, Bill Bolosky!
• UC Santa Cruz: Benedict Paten, David Haussler, Hannes Schmidt, Beau
Norgeot!
• And many other open source contributors, especially Michael Heuer, Neil
Ferguson, Andy Petrella, Xavier Tordior!
• Total of 40 contributors to ADAM/BDG from >12 institutions

Scalable Genome Analysis with ADAM

More Related Content

What's hot

Similar to Scalable Genome Analysis with ADAM

More from fnothaft

Recently uploaded

Scalable Genome Analysis with ADAM