Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Rethinking Data-Intensive
Science Using Scalable
Analytics Systems
Frank Austin Nothaft
UC Berkeley AMP/ASPIRE Lab, @fnoth...
Scientific revolutions are
driven by data acquisition
revolutions
Genome Sequencing
Source: NIH National Genome Research Institute
2014: ~230,000 genomes sequenced
15-250GB/genome = ~30TB/...
Sequencing advances line up well
with scalable analytics software
Source: NIH National Genome Research Institute
Google
Ma...
Mapping scientific systems to
commodity analytics systems
• Contemporary scientific systems are custom-built
• Leads to func...
Common Traits of Legacy Data
Intensive Scientific Systems
1. Computation is workflow/pipeline oriented
2. Processing system ...
Genomics Pipelines
Source: The Broad Institute of MIT/Harvard
Flat File Formats
• Scientific data is typically stored in application
specific file formats:
• Genomic reads: SAM/BAM, CRAM
...
Flat Architectures
• APIs present very barebones abstractions:
• GATK: Sorted iterator over the genome
• Why are flat archi...
The perils of flattening…
• The trivial:
• You can improve performance by pushing data
access order into your data layout
•...
A green field approach
First, define a schema
record AlignmentRecord {	
union { null, Contig } contig = null;	
union { null, long } start = null;	...
Application
Transformations
Physical Storage
Attached Storage
Data Distribution
Parallel FS
Materialized Data
Columnar Sto...
Accelerate common
access patterns
• In genomics, we commonly
have to find observations that
overlap in a coordinate plane
•...
Pick appropriate storage
• When accessing scientific
datasets, we frequently slice and
dice the dataset:
• Algorithms may t...
Is introducing a new data
model really a good idea?
Source: XKCD, http://xkcd.com/927/
A subtle point:!
Proper stack design can simplify
backwards compatibility
To support legacy data formats, you define a way ...
A subtle point:!
Proper stack design can simplify
backwards compatibility
This is a view!
Data Distribution
Materialized D...
A well designed stack
simplifies application design
Application
Transformations
Physical Storage
Attached Storage
Data Dist...
How does this perform
on real scientific data?
ADAM performs genomic
preprocessing
Source: The Broad Institute of MIT/Harvard
ADAM’s Performance
• Achieve linear scalability out
to 128 nodes for most tasks
• Up to 3x improvement over
current tools ...
Astronomy Pipelines
Source: The LSST Project
Astronomy Image
Co-addition Performance
• Scales out to 16 nodes
• ~3x improvement over extant
tool on a single node
Analy...
Conclusions
• There is a huge increase in the amount of scientific
data being processed
• Although scientific processing pip...
Acknowledgements
• ADAM (https://www.github.com/bigdatagenomics/adam):!
• UC Berkeley: Matt Massie, Timothy Danford, André...
Upcoming SlideShare
Loading in …5
×

Rethinking Data-Intensive Science Using Scalable Analytics Systems

1,286 views

Published on

Presentation from SIGMOD 2015. With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson. Paper at http://dl.acm.org/citation.cfm?id=2742787.

Published in: Engineering
  • Be the first to comment

Rethinking Data-Intensive Science Using Scalable Analytics Systems

  1. 1. Rethinking Data-Intensive Science Using Scalable Analytics Systems Frank Austin Nothaft UC Berkeley AMP/ASPIRE Lab, @fnothaft With Matt Massie, Timothy Danford, Zhao Zhang, Uri Laserson, Carl Yeksigian, Jey Kottalam, Arun Ahuja, Jeff Hammerbacher, Michael Linderman, Michael J. Franklin, Anthony D. Joseph, David A. Patterson
  2. 2. Scientific revolutions are driven by data acquisition revolutions
  3. 3. Genome Sequencing Source: NIH National Genome Research Institute 2014: ~230,000 genomes sequenced 15-250GB/genome = ~30TB/day = ~10PB/year Human Genome! Project: ~10GB 1000 Genomes: 15TB TCGA: 3PB
  4. 4. Sequencing advances line up well with scalable analytics software Source: NIH National Genome Research Institute Google MapReduce Hadoop MR Spark Parquet
  5. 5. Mapping scientific systems to commodity analytics systems • Contemporary scientific systems are custom-built • Leads to functionality from commodity systems being rebuilt • We have an opportunity to rethink the abstractions that scientific systems use: • Migrate from a flat architecture to a stacked architecture • Expose higher level programming primitives • Use commodity tools wherever possible
  6. 6. Common Traits of Legacy Data Intensive Scientific Systems 1. Computation is workflow/pipeline oriented 2. Processing system has monolithic/flat architecture 3. Data is stored in flat files
  7. 7. Genomics Pipelines Source: The Broad Institute of MIT/Harvard
  8. 8. Flat File Formats • Scientific data is typically stored in application specific file formats: • Genomic reads: SAM/BAM, CRAM • Genomic variants: VCF/BCF, MAF • Genomic features: BED, NarrowPeak, GTF • Centralized metadata makes it difficult to parallelize applications
  9. 9. Flat Architectures • APIs present very barebones abstractions: • GATK: Sorted iterator over the genome • Why are flat architectures bad? 1. Trivial: low level abstractions are not productive 2. Trivial: flat architectures create technical lock-in 3. Subtle: low level abstractions can introduce bugs
  10. 10. The perils of flattening… • The trivial: • You can improve performance by pushing data access order into your data layout • But now, you can’t easily compose pipeline stages that have different access orders • The obscure: • If you access data via a sorted iterator, will you incorrectly implement your algorithm?
  11. 11. A green field approach
  12. 12. First, define a schema record AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  13. 13. Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models A schema provides a narrow waistrecord AlignmentRecord { union { null, Contig } contig = null; union { null, long } start = null; union { null, long } end = null; union { null, int } mapq = null; union { null, string } readName = null; union { null, string } sequence = null; union { null, string } mateReference = null; union { null, long } mateAlignmentStart = null; union { null, string } cigar = null; union { null, string } qual = null; union { null, string } recordGroupName = null; union { int, null } basesTrimmedFromStart = 0; union { int, null } basesTrimmedFromEnd = 0; union { boolean, null } readPaired = false; union { boolean, null } properPair = false; union { boolean, null } readMapped = false; union { boolean, null } mateMapped = false; union { boolean, null } firstOfPair = false; union { boolean, null } secondOfPair = false; union { boolean, null } failedVendorQualityChecks = false; union { boolean, null } duplicateRead = false; union { boolean, null } readNegativeStrand = false; union { boolean, null } mateNegativeStrand = false; union { boolean, null } primaryAlignment = false; union { boolean, null } secondaryAlignment = false; union { boolean, null } supplementaryAlignment = false; union { null, string } mismatchingPositions = null; union { null, string } origQual = null; union { null, string } attributes = null; union { null, string } recordGroupSequencingCenter = null; union { null, string } recordGroupDescription = null; union { null, long } recordGroupRunDateEpoch = null; union { null, string } recordGroupFlowOrder = null; union { null, string } recordGroupKeySequence = null; union { null, string } recordGroupLibrary = null; union { null, int } recordGroupPredictedMedianInsertSize = null; union { null, string } recordGroupPlatform = null; union { null, string } recordGroupPlatformUnit = null; union { null, string } recordGroupSample = null; union { null, Contig } mateContig = null; } Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  14. 14. Accelerate common access patterns • In genomics, we commonly have to find observations that overlap in a coordinate plane • This coordinate plane is genomics specific, and is known a priori • We can use our knowledge of the coordinate plane to implement a fast overlap join Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  15. 15. Pick appropriate storage • When accessing scientific datasets, we frequently slice and dice the dataset: • Algorithms may touch subsets of columns • We don’t always touch the whole dataset • This is a good match for columnar storage Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models
  16. 16. Is introducing a new data model really a good idea? Source: XKCD, http://xkcd.com/927/
  17. 17. A subtle point:! Proper stack design can simplify backwards compatibility To support legacy data formats, you define a way to serialize/deserialize the schema into/from the legacy flat file format! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  18. 18. A subtle point:! Proper stack design can simplify backwards compatibility This is a view! Data Distribution Materialized Data Legacy File Format Schema Data Models Data Distribution Materialized Data Columnar Storage Schema Data Models
  19. 19. A well designed stack simplifies application design Application Transformations Physical Storage Attached Storage Data Distribution Parallel FS Materialized Data Columnar Storage Evidence Access MapReduce/DBMS Presentation Enriched Models Schema Data Models Variant calling & analysis, RNA-seq analysis, etc. Disk, SDD, block store, memory cache HDFS, Tachyon, HPC file systems, S3 Load data from Parquet and legacy formats Spark, Spark-SQL, Hadoop Enriched Read/Variant Avro Schema for reads, variants, and genotypes Users define analyses via transformations Enriched models provide convenient methods on common models The evidence access layer efficiently executes transformations Schemas define the logical structure of basic genomic objects Common interfaces map logical schema to bytes on disk Parallel file system layer coordinates distribution of data Decoupling storage enables performance/cost tradeoff
  20. 20. How does this perform on real scientific data?
  21. 21. ADAM performs genomic preprocessing Source: The Broad Institute of MIT/Harvard
  22. 22. ADAM’s Performance • Achieve linear scalability out to 128 nodes for most tasks • Up to 3x improvement over current tools on a single node Analysis run using Amazon EC2, single node was i2.8xlarge, cluster was r3.2xlarge Scripts available at https://www.github.com/bigdatagenomics/bdg-services.git
  23. 23. Astronomy Pipelines Source: The LSST Project
  24. 24. Astronomy Image Co-addition Performance • Scales out to 16 nodes • ~3x improvement over extant tool on a single node Analysis run using Amazon EC2, cluster was c3.8xlarge (HPC optimized)
  25. 25. Conclusions • There is a huge increase in the amount of scientific data being processed • Although scientific processing pipelines tend to be custom solutions, we can replace these pipelines with general, DBMS backed solutions • When we move to a general solution, we can gain performance without losing correctness
  26. 26. Acknowledgements • ADAM (https://www.github.com/bigdatagenomics/adam):! • UC Berkeley: Matt Massie, Timothy Danford, André Schumacher, Jey Kottalam, Karen Feng, Eric Tu, Niranjan Kumar, Ananth Pallaseni, Anthony Joseph, Dave Patterson! • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Ryan Williams, Michael Linderman, Jeff Hammerbacher! • GenomeBridge: Carl Yeksigian! • Cloudera: Uri Laserson! • Microsoft Research: Ravi Pandya! • UC Santa Cruz: Benedict Paten, David Haussler! • KIRA (https://www.github.com/BIDS/Kira):! • UC Berkeley: Zhao Zhang, Mike Franklin, Evan Sparks, Kyle Barbary, Oliver Zahn, Saul Perlmutter! • PoC code at https://github.com/zhaozhang/SparkMontage

×