Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Design for Scalability in ADAM

1,284 views

Published on

Published in: Engineering
  • Be the first to comment

Design for Scalability in ADAM

  1. 1. Design for Scalability in ADAM Frank Austin Nothaft UC Berkeley
  2. 2. What is ADAM? • An open source, high performance, distributed platform for genomic analysis • ADAM defines a: 1. Data schema and layout on disk* 2. A Scala API 3. A command line interface * Via Avro and Parquet
  3. 3. What’s the big picture? ADAM:! Core API + CLIs bdg-formats:! Data schemas RNAdam:! RNA analysis on ADAM avocado:! Distributed local assembler Guacamole:! Distributed somatic caller xASSEMBLEx:! GraphX-based de novo assembler bdg-services:! ADAM clusters
  4. 4. Implementation Overview • 27k LOC (99% Scala) • Apache 2 licensed OSS • 21 contributors across 8 institutions • Pushing for production 1.0 release towards end of year
  5. 5. Key Observations • Current genomics pipelines are I/O limited • Most genomics algorithms can be formulated as a data or graph parallel computation • These algorithms are heavy on iteration/pipelining • Data access pattern is write once, read many times • High coverage, whole genome will become main sequencing target (for human genetics)
  6. 6. Principles for Scalable Design in ADAM • Parallel FS and data representation (HDFS + Parquet) combined with in-memory computing eliminates disk bandwidth bottleneck • Spark allows efficient implementation of iterative/ pipelined Map-Reduce • Minimize data movement: send code to data
  7. 7. Data Format • Avro schema encoded by Parquet • Schema can be updated without breaking backwards compatibility • Read schema looks a lot like BAM, but renormalized • Actively removing tags • Variant schema is strictly biallelic, a “cell in the matrix” record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig} mateContig = null; }
  8. 8. Parquet • ASF Incubator project, based on Google Dremel • http://www.parquet.io • High performance columnar store with support for projections and push-down predicates • 3 layers of parallelism: • File/row group • Column chunk • Page Image from Parquet format definition: https://github.com/Parquet/parquet-format
  9. 9. Filtering • Parquet provides pushdown predication • Evaluate filter on a subset of columns • Only read full set of projected columns for passing records • Full primary/secondary indexing support in Parquet 2.0 • Very efficient if reading a small set of columns: • On disk, contig ID/start/end consume < 2% of space Image from Parquet format definition: https://github.com/Parquet/parquet-format
  10. 10. Compression • Parquet compresses at the column level: • RLE for repetitive columns • Dictionary encoding for quantized columns • ADAM uses a fully denormalized schema • Repetitive columns are RLE’d out • Delta encoding (Parquet 2.0) will aid with quality scores • ADAM is 5-25% smaller than compressed BAM
  11. 11. Parquet/Spark Integration • 1 row group in Parquet maps to 1 partition in Spark • We interact with Parquet via input/output formats • These apply projections and predicates, handle (de)compression • Spark builds and executes a computation DAG, manages data locality, errors/retries, etc. RG 1 RG 2 RG n… Parquet RG 1 RG 2 RG n… Parquet Spark Parquet Input Format Parquet Output Format Partition 1 Partition 2 Partition n …
  12. 12. Compatibility • Maintain full import/export compatibility with SAM/ BAM, VCF/BCF • Can use non-ADAM tools in pipeline:* * Via avocado: https://www.github.com/bigdatagenomics/avocado RG 1 RG 2 RG n … Part. 1 Part. 2 Part. n … Chr. 1 into Pipe Chr. 2 into Pipe Chr. M into Pipe … Repartition Repartition Part. 1 Part. 2 Part. n … RG 1 RG 2 RG n …
  13. 13. “Cloud” Optimizations • Emerging use case (?): processing data on public cloud provider machines, data stored in block store • E.g., Amazon EMR + S3 • We are optimizing Parquet for S3/other block stores: • Compact primary indices for slice lookup • Eliminate Parquet requirement on HDFS
  14. 14. • An in-memory data parallel computing framework • Optimized for iterative jobs —> unlike Hadoop • Data maintained in memory unless inter-node movement needed (e.g., on repartitioning) • Presents a functional programing API, along with support for iterative programming via REPL • Used at scale on clusters with >2k nodes, 4TB datasets
  15. 15. Why Spark? • Current leading map-reduce framework: • First in-memory map-reduce platform • Used at scale in industry, supported in major distros (Cloudera, HortonWorks, MapR) • The API: • Fully functional API • Main API in Scala, also support Java, Python, R • Manages node/job failures via lineage, data locality/job assignment • Downstream tools (GraphX, MLLib)
  16. 16. Cluster Setups • Spark is optimized for Hadoop, but is being run on traditional HPC clusters (e.g., LBNL, Janelia Farm) • Tachyon file system cache can be used as a high performance layer between Spark and HPC file systems • At Berkeley, we normally run on cloud vendors • Performance shows ~4x better on bare metal
  17. 17. Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Timothy Danford, Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Jeremy Elson, Ravi Pandya! • And many other open source contributors: 21 contributors to ADAM/BDG from >8 institutions
  18. 18. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation, Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk, Virdata, VMware, WANdisco and Yahoo!.

×