Successfully reported this slideshow.

Rethinking Data-Intensive Science Using Scalable Analytics Systems

2

Share

Loading in …3
×
1 of 26
1 of 26

Rethinking Data-Intensive Science Using Scalable Analytics Systems

2

Share

Download to read offline

Presentation from SIGMOD 2015. With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson. Paper at http://dl.acm.org/citation.cfm?id=2742787.

Presentation from SIGMOD 2015. With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson. Paper at http://dl.acm.org/citation.cfm?id=2742787.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Rethinking Data-Intensive Science Using Scalable Analytics Systems

  1. 1. Rethinking Data-Intensive Science Using Scalable Analytics Systems Frank Austin Nothaft UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
  2. 2. Scientific revolutions are driven by data acquisition revolutions
  3. 3. Genome Sequencing Source: NIH National Genome Research Institute 2014: ~230,000 genomes sequenced 15-250GB/genome = ~30TB/day = ~10PB/year Human Genome! Project: ~10GB 1000 Genomes: 15TB TCGA: 3PB
  4. 4. Sequencing advances line up well with scalable analytics software Source: NIH National Genome Research Institute Google MapReduce Hadoop MR Spark Parquet
  5. 5. Mapping scientific systems to commodity analytics systems • Contemporary scientific systems are custom-built • Leads to functionality from commodity systems being rebuilt • We have an opportunity to rethink the abstractions that scientific systems use: • Migrate from a flat architecture to a stacked architecture • Expose higher level programming primitives • Use commodity tools wherever possible
  6. 6. Common Traits of Legacy Data Intensive Scientific Systems 1. Computation is workflow/pipeline oriented 2. Processing system has monolithic/flat architecture 3. Data is stored in flat files
  7. 7. Genomics Pipelines Source: The Broad Institute of MIT/Harvard
  8. 8. Flat File Formats • Scientific data is typically stored in application specific file formats: • Genomic reads: SAM/BAM, CRAM • Genomic variants: VCF/BCF, MAF • Genomic features: BED, NarrowPeak, GTF • Centralized metadata makes it difficult to parallelize applications
  9. 9. Flat Architectures • APIs present very barebones abstractions: • GATK: Sorted iterator over the genome • Why are flat architectures bad? 1. Trivial: low level abstractions are not productive 2. Trivial: flat architectures create technical lock-in 3. Subtle: low level abstractions can introduce bugs
  10. 10. The perils of flattening… • The trivial: • You can improve performance by pushing data access order into your data layout • But now, you can’t easily compose pipeline stages that have different access orders • The obscure: • If you access data via a sorted iterator, will you incorrectly implement your algorithm?
  11. 11. A green field approach
  12. 12. First, define a schema record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  13. 13. Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models A schema provides a narrow waistrecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  14. 14. Accelerate common access patterns • In genomics, we commonly have to find observations that overlap in a coordinate plane • This coordinate plane is genomics specific, and is known a priori • We can use our knowledge of the coordinate plane to implement a fast overlap join Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  15. 15. Pick appropriate storage • When accessing scientific datasets, we frequently slice and dice the dataset: • Algorithms may touch subsets of columns • We don’t always touch the whole dataset • This is a good match for columnar storage Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  16. 16. Is introducing a new data model really a good idea? Source: XKCD, http://xkcd.com/927/
  17. 17. A subtle point:! Proper stack design can simplify backwards compatibility To support legacy data formats, you define a way to serialize/deserialize the schema into/from the legacy flat file format! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  18. 18. A subtle point:! Proper stack design can simplify backwards compatibility This is a view! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  19. 19. A well designed stack simplifies application design Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Variant calling & analysis, RNA-seq analysis, etc. Disk, SDD, block store, memory cache HDFS, Tachyon, HPC file systems, S3 Load data from Parquet and legacy formats Spark, Spark-SQL, Hadoop Enriched Read/Variant Avro Schema for reads, variants, and genotypes Users define analyses via transformations Enriched models provide convenient methods on common models The evidence access layer efficiently executes transformations Schemas define the logical structure of basic genomic objects Common interfaces map logical schema to bytes on disk Parallel file system layer coordinates distribution of data Decoupling storage enables performance/cost tradeoff
  20. 20. How does this perform on real scientific data?
  21. 21. ADAM performs genomic preprocessing Source: The Broad Institute of MIT/Harvard
  22. 22. ADAM’s Performance • Achieve linear scalability out to 128 nodes for most tasks • Up to 3x improvement over current tools on a single node Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git
  23. 23. Astronomy Pipelines Source: The LSST Project
  24. 24. Astronomy Image Co-addition Performance • Scales out to 16 nodes • ~3x improvement over extant tool on a single node Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)
  25. 25. Conclusions • There is a huge increase in the amount of scientific data being processed • Although scientific processing pipelines tend to be custom solutions, we can replace these pipelines with general, DBMS backed solutions • When we move to a general solution, we can gain performance without losing correctness
  26. 26. Acknowledgements • ADAM (https://www.github.com/bigdatagenomics/adam):! • UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Ravi Pandya! • UC Santa Cruz: Benedict Paten, David Haussler! • KIRA (https://www.github.com/BIDS/Kira):! • UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary, Oliver Zahn, Saul Perlmutter! • PoC code at https://github.com/zhaozhang/SparkMontage

×