Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
ADAM: Fast, Scalable
Genome Analysis
Frank Austin Nothaft	

AMPLab, University of California, Berkeley, @fnothaft	

	

wit...
What is in ADAM/BDG?
ADAM:
Core API +
CLIs
bdg-formats:
Data schemas
RNAdam:
RNA analysis on
ADAM
avocado:
Distributed loc...
Design Goals
• Develop processing pipeline that enables
efficient, scalable use of cluster/cloud	

• Provide data format t...
Implementation Overview
• 27K lines of Scala code	

• 100% Apache-licensed open-source	

• 21 contributors from 8 institut...
ADAM Stack
Physical
File/Block
Record/Split
‣Commodity Hardware	

‣Cloud Systems - Amazon, GCE, Azure
‣Hadoop Distributed ...
• Abstract as much as possible: schema
oriented design makes format easy to evolve	

• Provide rich and scalable APIs for ...
• OSS Created by Twitter and Cloudera, based on
Google Dremel, just entered Apache Incubator	

• Columnar File Format:	

•...
Scaling Genomics: BQSR
• Broadcast 3 GB table of
variants, used for masking	

• Break reads down to
bases and map bases to...
Performance/Acc’y
ADAM
0
10
20
30
40
50
GATK
0 10 20 30 40 50
• Fully concordant with Picard for MarkDup, >99%
concordant ...
Future Work
• Pushing hard towards production release	

• Are building out a complete analysis
pipeline	

• Plan to releas...
Call for contributions
• As an open source project, we welcome
contributions	

• We maintain a list of open enhancements a...
Acknowledgements
• UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam,
Christos Kozanitis	

• Mt. Sinai: Arun Ahuja, ...
Acknowledgements
This research is supported in part by NSF CISE
Expeditions Award CCF-1139158, LBNL Award
7076018, and DAR...
Upcoming SlideShare
Loading in …5
×

Adam bosc-071114

784 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Adam bosc-071114

  1. 1. ADAM: Fast, Scalable Genome Analysis Frank Austin Nothaft AMPLab, University of California, Berkeley, @fnothaft with: Matt Massie,André Schumacher,Timothy Danford, CarlYeksigian, Chris Hartl, Jey Kottalam,Arun Aruha, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher,Anthony Joseph, and Dave Patterson https://github.com/bigdatagenomics http://www.bdgenomics.org
  2. 2. What is in ADAM/BDG? ADAM: Core API + CLIs bdg-formats: Data schemas RNAdam: RNA analysis on ADAM avocado: Distributed local assembler Guacamole: Distributed somatic caller xASSEMBLEx: GraphX-based de novo assembler bdg-services: ADAM clusters
  3. 3. Design Goals • Develop processing pipeline that enables efficient, scalable use of cluster/cloud • Provide data format that has efficient parallel/distributed access across platforms • Enhance semantics of data and allow more flexible data access patterns
  4. 4. Implementation Overview • 27K lines of Scala code • 100% Apache-licensed open-source • 21 contributors from 8 institutions • Working towards a production quality release late 2014
  5. 5. ADAM Stack Physical File/Block Record/Split ‣Commodity Hardware ‣Cloud Systems - Amazon, GCE, Azure ‣Hadoop Distributed Filesystem ‣Local Filesystem ‣Schema-driven records w/ Apache Avro ‣Store and retrieve records using Parquet ‣Read BAM Files using Hadoop-BAM In-Memory RDD ‣Transform records using Apache Spark ‣Query with SQL using Shark ‣Graph processing with GraphX ‣Machine learning using MLBase
  6. 6. • Abstract as much as possible: schema oriented design makes format easy to evolve • Provide rich and scalable APIs for manipulating and transforming genomic data and regions • Don’t lock data in: play nicely with other tools Design Principles
  7. 7. • OSS Created by Twitter and Cloudera, based on Google Dremel, just entered Apache Incubator • Columnar File Format: • Limits I/O to only data that is needed • Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data • Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high- coverage file in less than a minute Parquet
  8. 8. Scaling Genomics: BQSR • Broadcast 3 GB table of variants, used for masking • Break reads down to bases and map bases to covariates • Calculate empirical values per covariate • Broadcast observation, apply across reads
  9. 9. Performance/Acc’y ADAM 0 10 20 30 40 50 GATK 0 10 20 30 40 50 • Fully concordant with Picard for MarkDup, >99% concordant with GATK for BQSR Hours 0 4 8 12 16 20 24 Sort Mark Duplicates BQSR Picard ADAM 100 EC2 Nodes
  10. 10. Future Work • Pushing hard towards production release • Are building out a complete analysis pipeline • Plan to release Python bindings • Work on interoperability with Global Alliance for Genomic Health API (http:// genomicsandhealth.org/)
  11. 11. Call for contributions • As an open source project, we welcome contributions • We maintain a list of open enhancements at our Github issue trackers • Github: https://www.github.com/bdgenomics • UC Berkeley is looking to hire two full time engineers to support this work
  12. 12. Acknowledgements • UC Berkeley: Matt Massie,André Schumacher, Jey Kottalam, Christos Kozanitis • Mt. Sinai: Arun Ahuja, Neal Sidhwaney, Michael Linderman, Jeff Hammerbacher • GenomeBridge: Timothy Danford, CarlYeksigian • The Broad Institute: Chris Hartl • Cloudera: Uri Laserson • Microsoft Research: Jeremy Elson, Ravi Pandya • Michael Heuer • And other open source contributors!
  13. 13. Acknowledgements This research is supported in part by NSF CISE Expeditions Award CCF-1139158, LBNL Award 7076018, and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, The Thomas and Stacey Siebel Foundation,Apple, Inc., C3Energy, Cisco, Cloudera, EMC, Ericsson, Facebook, GameOnTalis, Guavus, HP, Huawei, Intel, Microsoft, NetApp, Pivotal, Splunk,Virdata,VMware,WANdisco and Yahoo!.

×