Fast Variant Calling with
ADAM and avocado
Frank Austin Nothaft, UC Berkeley AMPLab
fnothaft@berkeley.edu, @fnothaft
2/19/2015
Data Intensive Genomics
• New population-scale experiments will sequence
10-100k samples
• 100k samples @ 60x WGS will generate ~20PB of
read data and ~300TB of genotype data
• End-to-end pipeline latency is important to clinical work
• We want to jointly analyze samples to uncover low
frequency variations
How can we improve
analysis productivity?
• Flat file formats sacrifice interoperability but do not
improve performance
• Common sort order invariants imposed by tools
compromise correctness
• Genomics APIs tend to be at a lower level of
abstraction, which compromises productivity
Our building block: ADAM
• ADAM is an open source, high performance, distributed
platform for genomic analysis
• ADAM defines a:
1. Data schema and layout on disk*
2. Programming interface for distributed processing
of genomic data**
3. Command line interface
* Via Parquet and Avro
** Work on Python integration is underway
ADAM’s guiding principle:
Use a schema as a narrow waist
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Storage
Evidence Access
MapReduce/DBMS
Presentation
Enriched Models
Schema
Data Models
Variant calling & analysis,
RNA-seq analysis, etc.
Disk, SDD, block
store, memory cache
HDFS, Tachyon, HPC file
systems, S3
Load data from Parquet and
legacy formats
Spark, Spark-SQL,
Hadoop
Enriched Read/Variant
Avro Schema for reads,
variants, and genotypes
Users define analyses
via transformations
Enriched models provide convenient
methods on common models
The evidence access layer
efficiently executes transformations
Schemas define the logical
structure of basic genomic objects
Common interfaces map logical
schema to bytes on disk
Parallel file system layer
coordinates distribution of data
Decoupling storage enables
performance/cost tradeoff
Data Format
• Schema can be updated without
breaking backwards compatibility
• Normalize metadata fields into
schema for O(1) metadata access
• Models are “dumb”; enhance as
necessary with rich objects
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig } mateContig = null;
}
Schemas at https://www.github.com/bigdatagenomics/bdg-formats
Parquet
• ASF Incubator project, based
on Google Dremel
• High performance columnar
store with support for
projections and push-down
predicates
• Short read data stored in
Parquet achieves a 25%
improvement in size over
compressed BAM
Image from Parquet format definition: https://www.github.com/apache/incubator-parquet-format
Backwards Compatibility
• Short reads: compatible with SAM, BAM, FASTQ
• Convert on read and write
• Working on CRAM support
• Variants, genotypes, and variant annotations
schemas can convert to/from VCF
• Support wide variety of genomic annotation formats
(e.g., GTF, BED, narrowPeak)
ADAM’s API Design
• ADAM is built on top of Apache Spark, which
provides the RDD abstraction —> distributed arrays
• Common primitives include:
• Aggregates: BQSR, Indel Realignment
• Bucketing: Duplicate Marking, Concordance
• Region Joins: Variant Calling and Filtration
ADAM’s Performance
• Achieve linear scalability out
to 128 nodes for most tasks
• 2-4x improvement over {GATK,
samtools,Picard} on single
node
Analysis run using Amazon EC2, single node was hs1.8xlarge, cluster was m2.4xlarge
Scripts available at https://www.github.com/fnothaft/bdg-recipes.git, “sigmod" branch
BDG: ADAM’s Ecosystem
!
ADAM:!
Core API +
CLIs
bdg-
formats:!
Data
schemas
RNAdam:!
RNA analysis
on ADAM
avocado:!
Distributed local
assembler
PacMin:!
Long read
assembly
eggo:!
Datasets
Downstream focus:
Genome Resequencing
• We’re working on two approaches:
• avocado: find variants via local reassembly
• PacMin: use long reads to find variants via de
novo assembly
• We’ll focus on avocado today
What are the challenges?
• For accurate INDEL discovery, we want to
reassemble variants, but reassembly is expensive
• We need to statistically integrate over a large
collection of samples to discover low frequency
variants
• The reference genome is not always representative
avocado performs efficient
de Bruijn reassembly
ACACTGCACT
ACA
CAC
ACT
CTG
TGC
GCA
CAC
ACT
ACA CAC ACT
CTGTGCGCA
• Several high accuracy variant callers (GATK, Platypus,
Scalpel) reassemble reads aligned at genomic regions
• Typically use a de Bruijn graph: nodes are k-mers, and
edges represent observed transitions between k-mers
Efficient Local Reassembly
• Current methods elaborate all paths through the graph, perform O(hn)
realignments at O(lrlh) cost, score O(h2
) haplotype pairs
• Instead, identify “bubbles” and emit statistics directly from the graph:
• Eliminate expensive realignment!
• Variant alleles are provably canonical.
ACA CAC ACT
CTGTGCGCA
CTTTTCTCA
Reference:
CTGA
Bubble:
CTTA
h: number of haplotypes (paths), n: number of reads, lr: read length, lh: haplotype length
Proofs that alleles are canonical are too long for slides; will gladly share offline.
Genotyping
• Use sliding “window” traversal of genome to bucket
sites
• Currently use a model based off of the samtools
mpileup genotype likelihood and EM algorithms
• Moving to monoallelic “allele graph” model
A CA C C T C T G T C
A C C C T C T G T C
A CA C C C C T G T C
A CA C C C C T G T C
A C C C T C TT G T C
Allele Graphs
• Edges of graph define conditional probabilities
!
!
• Can efficiently marginalize probabilities over graph via belief
propagation, exactly solve for argmax
ACACTCG
C
A
TCTCA
G
C
TCCACACT
Notes:!
X = copy number of this allele
Y = copy number of preceding allele
k = number of reads observed
j = number of reads supporting Y —> X transition
Pi = probability that read i supports Y —> X transition
Future Work
• When integrating over samples, we should cluster
samples by similarity
• Working on “multi-region” assembly; will integrate
alt references, “similar regions”
• Performance and accuracy evaluation on Illumina
Platinum pedigree, 1000 Genomes
Acknowledgements
• UC Berkeley: Matt Massie, Timothy Danford, André Schumacher,
Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth
Pallaseni, Anthony Joseph, Dave Patterson!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael
Linderman, Jeff Hammerbacher!
• GenomeBridge: Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Ravi Pandya!
• UC Santa Cruz: Benedict Paten, David Haussler!
• And many other open source contributors, especially Michael Heuer,
Neil Ferguson, Andy Petrella, Xavier Tordior!
• Total of 27 contributors to ADAM/BDG from >12 institutions