Strata Big Data Science Talk on ADAM
Upcoming SlideShare
Loading in...5
×
 

Strata Big Data Science Talk on ADAM

on

  • 4,356 views

Frank Nothaft's talk at the Strata Big Data Science meetup

Frank Nothaft's talk at the Strata Big Data Science meetup

Statistics

Views

Total Views
4,356
Views on SlideShare
470
Embed Views
3,886

Actions

Likes
3
Downloads
20
Comments
0

12 Embeds 3,886

http://bdgenomics.org 3646
http://bigdatagenomics.github.io 112
http://localhost 99
http://adam.cs.berkeley.edu 7
http://www.linkedin.com 5
http://www.google.co.nz 4
http://webcache.googleusercontent.com 4
http://www.google.com 3
http://plus.url.google.com 3
http://feedly.com 1
http://www.gizoogle.net 1
http://www.google.com.au 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Strata Big Data Science Talk on ADAM Strata Big Data Science Talk on ADAM Presentation Transcript

  • ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley with: Matt Massie, André Schumacher, Timothy Danford, Chris Hartl, Jey Kottalam, Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher, Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics Wednesday, February 12, 14
  • Problem • Whole genome files are large • Biological systems are complex • Population analysis requires petabytes of data • Analysis time is often a matter of life and death Wednesday, February 12, 14
  • Shredded Book Analogy Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, … Slide credit to Michael Schatz http://schatzlab.cshl.edu/ Wednesday, February 12, 14
  • Shredded Book Analogy Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools It was the best of It was the best It was the It was It times, it was the worst of times, it was the best of times, it was the best of times, it was the best of times, of times, it was the worst of times, it was the worst of times, it was the worst of times, it was the worst of age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of times, it was the age the age of foolishness, … was the age of foolishness, it was the age of wisdom, it was the age of wisdom, it was the foolishness, … of foolishness, … age of foolishness, … Slide credit to Michael Schatz http://schatzlab.cshl.edu/ Wednesday, February 12, 14
  • Shredded Book Analogy Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools It was the best of It was the best It was the It was It times, it was the worst of times, it was the best of times, it was the best of times, it was the best of times, of times, it was the worst of times, it was the worst of times, it was the worst of times, it was the worst of age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of times, it was the age the age of foolishness, … was the age of foolishness, it was the age of wisdom, it was the age of wisdom, it was the foolishness, … of foolishness, … age of foolishness, … • How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical Slide credit to Michael Schatz http://schatzlab.cshl.edu/ Wednesday, February 12, 14
  • Shredded Book Analogy Dickens accidentally shreds the first printing of A Tale of Two Cities Text printed on 5 long spools It was the best of It was the best It was the It was It times, it was the worst of times, it was the best of times, it was the best of times, it was the best of times, of times, it was the worst of times, it was the worst of times, it was the worst of times, it was the worst of age of wisdom, it was the age of wisdom, it was the age of wisdom, it was the age of times, it was the age the age of foolishness, … was the age of foolishness, it was the age of wisdom, it was the age of wisdom, it was the foolishness, … of foolishness, … age of foolishness, … • How can he reconstruct the text? – 5 copies x 138, 656 words / 5 words per fragment = 138k fragments – The short fragments from every copy are mixed together – Some fragments are identical Slide credit to Michael Schatz http://schatzlab.cshl.edu/ Wednesday, February 12, 14
  • What is ADAM? • File formats: columnar file format that allows efficient parallel access to genomes • API: interface for transforming, analyzing, and querying genomic data • CLI: a handy toolkit for quickly processing genomes Wednesday, February 12, 14
  • SNAP ADAM SiRen CAGE Avocado MLBase GraphX The Broad Institute “Best Practices” Pipeline Wednesday, February 12, 14
  • Design Goals • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms • Enhance semantics of data and allow more flexible data access patterns Wednesday, February 12, 14
  • Whole Genome Data Sizes Input Pipeline Stage Output SNAP 1GB Fasta Alignment 150GB Fastq 250GB BAM ADAM 250GB BAM Preprocessing 200GB ADAM Avocado 200GB ADAM 10MB ADAM Variant Calling Variants found at about 1 in 1,000 loci Wednesday, February 12, 14
  • ADAM Stack RDD ‣Transform records using Apache Spark ‣Query with SQL using Shark ‣Graph processing with GraphX ‣Machine learning using MLBase ‣Schema-driven records w/ Apache Avro Record/Split ‣Store and retrieve records using Parquet ‣Read BAM Files using Hadoop-BAM File/Block Physical Wednesday, February 12, 14 ‣Hadoop Distributed Filesystem ‣Local Filesystem ‣Commodity Hardware ‣Cloud Systems - Amazon, GCE, Azure
  • Commits Implementation • • • • Work accelerated at end of September • Proof of concept implementations at The Broad Institute, Duke, Harvard, UCSC • Global Alliance looking at ADAM as potential standard Wednesday, February 12, 14 15K lines of Scala code 100% Apache-licensed open-source Code contributions from Mt. Sinai, GenomeBridge, The Broad Institute
  • AWS EC2 Performance NA12878 Whole Genome 234GB as BAM, 229 GB as ADAM Sort Mark Duplicates 1064 1222 Picard 536 ADAM 1 node 32 ADAM 32 nodes 20 28 1 10 53x 100 Minutes (log scale) Wednesday, February 12, 14 33x 70 ADAM 100 nodes 2x 1000 10000
  • Mt. Sinai Performance NA12878 Whole Genome 234GB as BAM, 229 GB as ADAM Sort Mark Duplicates 1064 1222 Picard 29 50 17 34 11 15 ADAM 16 nodes ADAM 32 nodes ADAM 64 nodes 8 ADAM 82 nodes 1 10 62x 96x 133x 14 100 Minutes (log scale) Wednesday, February 12, 14 36x 1000 10000
  • Scalability HG00096 à 16GB NA12878 à 234GB Wednesday, February 12, 14
  • Acknowledgements • UC Berkeley: Matt Massie, André Schumacher, Jey Kottalam, Christos Kozanitis • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, Carl Yeksigian Wednesday, February 12, 14
  • Acknowledgements This research is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Apple, Inc., Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, GameOnTalis, General Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Samsung, Splunk,VMware, WANdisco and Yahoo!. Wednesday, February 12, 14
  • “Before seeing your data, I would not think sorting could be done that fast.” - research scientist at the Broad Institute Wednesday, February 12, 14
  • Call for contributions • As an open source project, we welcome contributions • We maintain a list of open enhancements at our Github issue tracker • Enhancements tagged with “Pick me up!” don’t require a genomics background • Github: https://github.com/bigdatagenomics/ Wednesday, February 12, 14