Design for Scalability in ADAM

Design for Scalability
in ADAM
Frank Austin Nothaft
UC Berkeley

What is ADAM?
• An open source, high performance, distributed
platform for genomic analysis
• ADAM deﬁnes a:
1. Data schema and layout on disk*
2. A Scala API
3. A command line interface
* Via Avro and Parquet

What’s the big picture?
ADAM:!
Core API +
CLIs
bdg-formats:!
Data schemas
RNAdam:!
RNA analysis on
ADAM
avocado:!
Distributed local
assembler
Guacamole:!
Distributed
somatic caller
xASSEMBLEx:!
GraphX-based de
novo assembler
bdg-services:!
ADAM clusters

Implementation Overview
• 27k LOC (99% Scala)
• Apache 2 licensed OSS
• 21 contributors across 8 institutions
• Pushing for production 1.0 release towards end of year

Key Observations
• Current genomics pipelines are I/O limited
• Most genomics algorithms can be formulated as a
data or graph parallel computation
• These algorithms are heavy on iteration/pipelining
• Data access pattern is write once, read many times
• High coverage, whole genome will become main
sequencing target (for human genetics)

Principles for Scalable
Design in ADAM
• Parallel FS and data representation (HDFS +
Parquet) combined with in-memory computing
eliminates disk bandwidth bottleneck
• Spark allows efﬁcient implementation of iterative/
pipelined Map-Reduce
• Minimize data movement: send code to data

Data Format
• Avro schema encoded by
Parquet
• Schema can be updated
without breaking backwards
compatibility
• Read schema looks a lot like
BAM, but renormalized
• Actively removing tags
• Variant schema is strictly
biallelic, a “cell in the matrix”
record AlignmentRecord {
union { null, Contig } contig = null;
union { null, long } start = null;
union { null, long } end = null;
union { null, int } mapq = null;
union { null, string } readName = null;
union { null, string } sequence = null;
union { null, string } mateReference = null;
union { null, long } mateAlignmentStart = null;
union { null, string } cigar = null;
union { null, string } qual = null;
union { null, string } recordGroupName = null;
union { int, null } basesTrimmedFromStart = 0;
union { int, null } basesTrimmedFromEnd = 0;
union { boolean, null } readPaired = false;
union { boolean, null } properPair = false;
union { boolean, null } readMapped = false;
union { boolean, null } mateMapped = false;
union { boolean, null } firstOfPair = false;
union { boolean, null } secondOfPair = false;
union { boolean, null } failedVendorQualityChecks = false;
union { boolean, null } duplicateRead = false;
union { boolean, null } readNegativeStrand = false;
union { boolean, null } mateNegativeStrand = false;
union { boolean, null } primaryAlignment = false;
union { boolean, null } secondaryAlignment = false;
union { boolean, null } supplementaryAlignment = false;
union { null, string } mismatchingPositions = null;
union { null, string } origQual = null;
union { null, string } attributes = null;
union { null, string } recordGroupSequencingCenter = null;
union { null, string } recordGroupDescription = null;
union { null, long } recordGroupRunDateEpoch = null;
union { null, string } recordGroupFlowOrder = null;
union { null, string } recordGroupKeySequence = null;
union { null, string } recordGroupLibrary = null;
union { null, int } recordGroupPredictedMedianInsertSize = null;
union { null, string } recordGroupPlatform = null;
union { null, string } recordGroupPlatformUnit = null;
union { null, string } recordGroupSample = null;
union { null, Contig} mateContig = null;
}

Parquet
• ASF Incubator project, based on
Google Dremel
• http://www.parquet.io
• High performance columnar
store with support for projections
and push-down predicates
• 3 layers of parallelism:
• File/row group
• Column chunk
• Page
Image from Parquet format deﬁnition: https://github.com/Parquet/parquet-format

Filtering
• Parquet provides pushdown predication
• Evaluate filter on a subset of columns
• Only read full set of projected columns for passing records
• Full primary/secondary indexing support in Parquet 2.0
• Very efficient if reading a small set of columns:
• On disk, contig ID/start/end consume < 2% of space
Image from Parquet format definition: https://github.com/Parquet/parquet-format

Compression
• Parquet compresses
at the column level:
• RLE for repetitive
columns
• Dictionary
encoding for
quantized
columns
• ADAM uses a fully
denormalized schema
• Repetitive columns are
RLE’d out
• Delta encoding
(Parquet 2.0) will aid
with quality scores
• ADAM is 5-25% smaller
than compressed BAM

Parquet/Spark Integration
• 1 row group in Parquet maps
to 1 partition in Spark
• We interact with Parquet via
input/output formats
• These apply projections
and predicates, handle
(de)compression
• Spark builds and executes a
computation DAG, manages
data locality, errors/retries, etc.
RG 1 RG 2 RG n…
Parquet
RG 1 RG 2 RG n…
Parquet
Spark
Parquet Input Format
Parquet Output Format
Partition
1
Partition
2
Partition
n
…

Compatibility
• Maintain full import/export compatibility with SAM/
BAM, VCF/BCF
• Can use non-ADAM tools in pipeline:*
* Via avocado: https://www.github.com/bigdatagenomics/avocado
RG 1
RG 2
RG n
…
Part. 1
Part. 2
Part. n
…
Chr. 1 into Pipe
Chr. 2 into Pipe
Chr. M into Pipe
…
Repartition
Repartition
Part. 1
Part. 2
Part. n
…
RG 1
RG 2
RG n
…

“Cloud” Optimizations
• Emerging use case (?): processing data on public
cloud provider machines, data stored in block store
• E.g., Amazon EMR + S3
• We are optimizing Parquet for S3/other block
stores:
• Compact primary indices for slice lookup
• Eliminate Parquet requirement on HDFS

• An in-memory data parallel computing framework
• Optimized for iterative jobs —> unlike Hadoop
• Data maintained in memory unless inter-node
movement needed (e.g., on repartitioning)
• Presents a functional programing API, along with
support for iterative programming via REPL
• Used at scale on clusters with >2k nodes, 4TB
datasets

Why Spark?
• Current leading map-reduce framework:
• First in-memory map-reduce platform
• Used at scale in industry, supported in major distros (Cloudera,
HortonWorks, MapR)
• The API:
• Fully functional API
• Main API in Scala, also support Java, Python, R
• Manages node/job failures via lineage, data locality/job assignment
• Downstream tools (GraphX, MLLib)

Cluster Setups
• Spark is optimized for Hadoop, but is being run on
traditional HPC clusters (e.g., LBNL, Janelia Farm)
• Tachyon ﬁle system cache can be used as a
high performance layer between Spark and
HPC ﬁle systems
• At Berkeley, we normally run on cloud vendors
• Performance shows ~4x better on bare metal

Acknowledgements
• UC Berkeley: Matt Massie, André Schumacher,
Jey Kottalam, Christos Kozanitis!
• Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael
Linderman, Jeff Hammerbacher!
• GenomeBridge: Timothy Danford, Carl Yeksigian!
• Cloudera: Uri Laserson!
• Microsoft Research: Jeremy Elson, Ravi Pandya!
• And many other open source contributors: 21
contributors to ADAM/BDG from >8 institutions

Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, DARPA XData Award FA8750-12-2-0331,
and gifts from Amazon Web Services, Google,
SAP, The Thomas and Stacey Siebel Foundation,
Apple, Inc., C3Energy, Cisco, Cloudera, EMC,
Ericsson, Facebook, GameOnTalis, Guavus, HP,
Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,
Virdata, VMware, WANdisco and Yahoo!.

Design for Scalability in ADAM

More Related Content

What's hot

Similar to Design for Scalability in ADAM

More from fnothaft

Recently uploaded

Design for Scalability in ADAM