2. Data Intensive Genomics
• Scale of genomic analyses is growing rapidly:
• New experiments sequence 10-100k samples
• 100k samples @ 60x WGS will generate ~20PB of
read data and ~300TB of genotype data
3. Our building block: ADAM
• ADAM is an open source, high performance,
distributed platform for genomic analysis
• ADAM defines a:
1. Data schema and layout on disk*
2. Integration to Spark’s Scala and Java APIs**
3. A command line interface
* Via Parquet and Avro
** Work on Python integration is underway
4. ADAM’s Performance
• Achieve linear scalability out
to 128 nodes for most tasks
• 2-4x improvement over {GATK,
samtools,Picard} on single
node
5. ADAM: Implementation
• 34k LOC (96% Scala)
• Apache 2 licensed OSS
• 27 contributors across 12 institutions
6. Big Data Genomics
!
ADAM:!
Core API +
CLIs
bdg-
formats:!
Data
schemas
RNAdam:!
RNA analysis
on ADAM
avocado:!
Distributed local
assembler
PacMin:!
Long read
assembly
eggo:!
Datasets
7. Downstream focus:
Genome Resequencing
• Resequencing: sequence a sample, and compute
diff from “average” genome —> variants
• We’re working on two approaches:
• avocado: find variants via local reassembly
• PacMin: use long reads to find variants via de
novo assembly
8. The Sequencing Abstraction
It was the best of times, it was the worst of times…
Metaphor borrowed from Michael Schatz
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
9. The Alignment Abstraction
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times was the worst
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
10. Sequence Assembly
It was the best of times, it was the worst of times…
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
11. avocado performs efficient
de Bruijn reassembly
ACACTGCACT
ACA
CAC
ACT
CTG
TGC
GCA
CAC
ACT
ACA CAC ACT
CTGTGCGCA
• Some high accuracy variant callers (GATK, Platypus,
Scalpel) reassemble reads aligned at genomic regions
• Typically use a de Bruijn graph: nodes are k-mers, and
edges represent observed transitions between k-mers
12. Efficient Local Reassembly
ACA CAC ACT
CTGTGCGCA
• Current methods elaborate all paths through the graph, perform O(hn)
realignments at O(lrlh) cost, score O(h2
) haplotype pairs
• Instead, identify “bubbles” and emit statistics directly from the graph:
• Eliminate expensive realignment!
• Most variant alleles are provably canonical.
CTTTTCTCA
Reference:
CTGA
Bubble:
CTTA
h: number of haplotypes (paths), n: number of reads, lr: read length, lh: haplotype length
16. Alternate Formulation:
Overlap Graph Assembly
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
It was the
the best of
times, it was
the worst of
worst of times
best of times
was the worst
17. PacMin performs fast
overlapping with MinHashing
• Wonderful realization by Berlin et al1
: overlapping is
similar to document similarity problem
• Use MinHashing to approximate similarity:
1: Berlin et al, bioRxiv 2014
Per document/read,
compute signature:!
!
1. Cut into shingles
2. Apply random
hashes to shingles
3. Take min over all
random hashes
Hash into buckets:!
!
Signatures of length l
can be hashed into b
buckets, so we expect
to compare all elements
with similarity
≥ (1/b)^(b/l)
Compare:!
!
For two documents with
signatures of length l,
Jaccard similarity is
estimated by
(# equal hashes) / l
!
• Easy to implement in Spark: map, groupBy, map, filter
18. Work in progress:
• Benchmarking avocado and PacMin:
• Running on genomes with orthogonal
validation from NIST/Genome In A Bottle and
Harvard Personal Genome Project
• Implementing monoallelic statistical model
19. Acknowledgements
• UC Berkeley: Matt Massie, Timothy Danford, André
Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan
Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams,
Michael Linderman, Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Ravi Pandya!
• And many other open source contributors, especially Michael
Heuer, Neil Ferguson, and Andy Petrella!
• Total of 27 contributors to ADAM/BDG from >12 institutions