SlideShare a Scribd company logo
Optimizing Large Genome Assembly in the Cloud
Apurva Kumar, Soumyarupa De and Kenneth Yocum
University of California, San Diego
Genomic Analytics
Data
• High Throughput Sequencing for human (African male
NA18507) produce : 3.5 billions 36 base pairs (bp) reads =
756 MB. (Cost : 1000 USD)
Current Problem : To assemble genome in lesser cost and
minimal time .
Why this problem ?
• Only 60 complete human genomes publicly exist(including
Steve Jobs who got it sequenced for 100K USD)
• Commercialization is still in early stage and growing rapidly
Question How do we scale this 3.5 B bp input data to
compute the final genome sequence ?
Optimizing using Stateful Bulk Processing
Newt Lineage
Current Assembly Techniques
Make use of Stateful Continuous Bulk
Processing (aka CBP) : Efficient stateful
graph processing model (supports even
Google’s “Pregel”) and run it on top of
Azure.
Output data
Contrail implemented as CBP StagesInput flows
Output flows
Fin
edges
Fout
edges
T( , ΔFstate, ΔF1)
Fout
state
Fin
state
key
Δin
Δout
Translate function for a stage with
state as loopback flow
Our Approach
Aim : To assemble a human genome (3.5 B 36 bp)
Genomic Analysis
Graph Processing
CBP on Azure (+ Newt (Debug))
Genome assembly using graphs Bulk Procesing with State
Errors in Contrail Pipeline Debugging via NewtContrail with Newt
• Multi-staged pipeline with several
MapReduce stages
• Types of errors:
• De Bruijn graph for a
sample short input
reads
• Eulerian walk across
the graph gives the
genomic sequence
Fail-stop
Corrupted
Outputs
Suspicious
Actors
• Newt “fail” API triggered on crash
- Reports crash culprits to Newt
• Find inputs that lead to corruption or
fail-stops
- Prune selected inputs and replay entire
pipeline
• Crash avoidance: remove crash culprits immediately and
continue
- No replay overhead, transparent fault handling
- How to handle removed inputs used in other dataflow paths?
• Online identification of suspicious actor behavior based
on actor’s history
- Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1)
- How much history?
Backward Tracing
Build
GraphShort read
files
Graph
refinement
Genome
sequence
Porting completed
Porting pending
Assembler Technology CPU/RAM
Velvet Serial 2 TB RAM
ABySS MPI 168 cores x 96 hours
SOAP denovo Pthreads 40 cores x 40 hours, > 140
GB RAM (total)
Output data
Input data Output AExtract links
Count in-links:
Site/URL frequency
Merge w/seen
Score and
threshold
D σ(D)
state
ΔD Output ΔAExtract
links
Count in-links:
Site/URL frequency
Merge w/seen
Score and
threshold
state state
1.) Process Dataflow σ with input D
2.) Create state
3.) Process changes, ΔD, and prior state
Updates, does not reprocess.
Saves CPU, disk, network and energy!
• Newt: Provenance
manager
• Capture fine-grained
provenance in
MapReduce jobs
• Trace data
provenance through
multi-staged pipeline
• Replay actors with
selected inputs
• Graph refinement
phase • Build graph stage is stateful
(saves src,dst information).
• 10 MR stages (Contrail)
mapped.
• Graph refinement stage is
stateful and iterative (refines
the graph).
• 30 MR stages (Contrail) to be
mapped.
Contrail
(Schatz et al. 2009) uses Hadoop – builds a big
graph(>3B nodes and >10B edges), iterates, scales
but inefficient, slow (1.5 MB input takes 2 hours to
assemble) and complex (40+ MR stages )!!!

More Related Content

What's hot

H2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelH2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno Candel
Sri Ambati
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
Prashant Raaghav
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
Riyaz Shaikh
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
Marcin Szepczyński
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
Fred Moyer
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
The productivity brought by Clojure
The productivity brought by ClojureThe productivity brought by Clojure
The productivity brought by Clojure
Laurence Chen
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Cody Ray
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
Ziemowit Jankowski
 

What's hot (9)

H2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno CandelH2O World - Welcome to H2O World with Arno Candel
H2O World - Welcome to H2O World with Arno Candel
 
Comparing pregel related systems
Comparing pregel related systemsComparing pregel related systems
Comparing pregel related systems
 
Big data solution capacity planning
Big data solution capacity planningBig data solution capacity planning
Big data solution capacity planning
 
Influxdb and time series data
Influxdb and time series dataInfluxdb and time series data
Influxdb and time series data
 
Latency SLOs done right
Latency SLOs done rightLatency SLOs done right
Latency SLOs done right
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
The productivity brought by Clojure
The productivity brought by ClojureThe productivity brought by Clojure
The productivity brought by Clojure
 
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDBBuilding a Scalable Distributed Stats Infrastructure with Storm and KairosDB
Building a Scalable Distributed Stats Infrastructure with Storm and KairosDB
 
Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes Case study- Real-time OLAP Cubes
Case study- Real-time OLAP Cubes
 

Viewers also liked

TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
Genome Reference Consortium
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
Genome Reference Consortium
 
Rna splicing
Rna splicingRna splicing
Rna splicing
Prachee Rajput
 
Rna splicing
Rna splicingRna splicing
Rna splicing
askhambhati
 
Fluorescence in situ Hybridization FISH #glok92
Fluorescence in situ Hybridization FISH #glok92Fluorescence in situ Hybridization FISH #glok92
Fluorescence in situ Hybridization FISH #glok92
glok Productions
 
Fluorescent in-situ Hybridization (FISH)
Fluorescent in-situ Hybridization (FISH)Fluorescent in-situ Hybridization (FISH)
Fluorescent in-situ Hybridization (FISH)
BioGenex
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
14cummke
 
Dna Sequencing
Dna SequencingDna Sequencing
Dna Sequencing
Zahoor Ahmed
 
DNA SEQUENCING METHOD
DNA SEQUENCING METHODDNA SEQUENCING METHOD
DNA SEQUENCING METHOD
Musa Khan
 
Fish(flourescent in-situ hybridization)
Fish(flourescent in-situ hybridization)Fish(flourescent in-situ hybridization)
Fish(flourescent in-situ hybridization)
naren
 

Viewers also liked (10)

TAGC2016 schneider
TAGC2016 schneiderTAGC2016 schneider
TAGC2016 schneider
 
Understanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL HackathonUnderstanding the reference assembly: CSHL Hackathon
Understanding the reference assembly: CSHL Hackathon
 
Rna splicing
Rna splicingRna splicing
Rna splicing
 
Rna splicing
Rna splicingRna splicing
Rna splicing
 
Fluorescence in situ Hybridization FISH #glok92
Fluorescence in situ Hybridization FISH #glok92Fluorescence in situ Hybridization FISH #glok92
Fluorescence in situ Hybridization FISH #glok92
 
Fluorescent in-situ Hybridization (FISH)
Fluorescent in-situ Hybridization (FISH)Fluorescent in-situ Hybridization (FISH)
Fluorescent in-situ Hybridization (FISH)
 
Dna sequencing powerpoint
Dna sequencing powerpointDna sequencing powerpoint
Dna sequencing powerpoint
 
Dna Sequencing
Dna SequencingDna Sequencing
Dna Sequencing
 
DNA SEQUENCING METHOD
DNA SEQUENCING METHODDNA SEQUENCING METHOD
DNA SEQUENCING METHOD
 
Fish(flourescent in-situ hybridization)
Fish(flourescent in-situ hybridization)Fish(flourescent in-situ hybridization)
Fish(flourescent in-situ hybridization)
 

Similar to CNS_poster12

Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
Maycon Viana Bordin
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
Raminder Singh
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
Ramūnas Urbonas
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
Meng-Ru (Raymond) Tsai
 
Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5
guestea711d0
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
Zhong Wang
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Databricks
 
Performance
PerformancePerformance
Performance
Christophe Marchal
 
Full Stack Load Testing
Full Stack Load Testing Full Stack Load Testing
Full Stack Load Testing
Terral R Jordan
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
✔ Eric David Benari, PMP
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
Jags Ramnarayan
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
Ian Foster
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
balmanme
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
balmanme
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
Omid Vahdaty
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Apache Apex
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
Guy Coates
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis
Yi-Feng Chang
 

Similar to CNS_poster12 (20)

Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
XSEDE15_PhastaGateway
XSEDE15_PhastaGatewayXSEDE15_PhastaGateway
XSEDE15_PhastaGateway
 
Sessionization with Spark streaming
Sessionization with Spark streamingSessionization with Spark streaming
Sessionization with Spark streaming
 
20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final20211119 ntuh azure hpc workshop final
20211119 ntuh azure hpc workshop final
 
Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5Hp Connect 10 06 08 V5
Hp Connect 10 06 08 V5
 
BioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing dataBioPig for scalable analysis of big sequencing data
BioPig for scalable analysis of big sequencing data
 
Taking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFramesTaking Spark Streaming to the Next Level with Datasets and DataFrames
Taking Spark Streaming to the Next Level with Datasets and DataFrames
 
Performance
PerformancePerformance
Performance
 
Full Stack Load Testing
Full Stack Load Testing Full Stack Load Testing
Full Stack Load Testing
 
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, BlazegraphDatabase Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
Database Camp 2016 @ United Nations, NYC - Brad Bebee, CEO, Blazegraph
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Presentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshopPresentation southernstork 2009-nov-southernworkshop
Presentation southernstork 2009-nov-southernworkshop
 
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...Network-aware Data Management for Large Scale Distributed Applications, IBM R...
Network-aware Data Management for Large Scale Distributed Applications, IBM R...
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and TransformIntro to Apache Apex - Next Gen Platform for Ingest and Transform
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
 
Cloud Experiences
Cloud ExperiencesCloud Experiences
Cloud Experiences
 
20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis20141219 workshop methylation sequencing analysis
20141219 workshop methylation sequencing analysis
 

CNS_poster12

  • 1. Optimizing Large Genome Assembly in the Cloud Apurva Kumar, Soumyarupa De and Kenneth Yocum University of California, San Diego Genomic Analytics Data • High Throughput Sequencing for human (African male NA18507) produce : 3.5 billions 36 base pairs (bp) reads = 756 MB. (Cost : 1000 USD) Current Problem : To assemble genome in lesser cost and minimal time . Why this problem ? • Only 60 complete human genomes publicly exist(including Steve Jobs who got it sequenced for 100K USD) • Commercialization is still in early stage and growing rapidly Question How do we scale this 3.5 B bp input data to compute the final genome sequence ? Optimizing using Stateful Bulk Processing Newt Lineage Current Assembly Techniques Make use of Stateful Continuous Bulk Processing (aka CBP) : Efficient stateful graph processing model (supports even Google’s “Pregel”) and run it on top of Azure. Output data Contrail implemented as CBP StagesInput flows Output flows Fin edges Fout edges T( , ΔFstate, ΔF1) Fout state Fin state key Δin Δout Translate function for a stage with state as loopback flow Our Approach Aim : To assemble a human genome (3.5 B 36 bp) Genomic Analysis Graph Processing CBP on Azure (+ Newt (Debug)) Genome assembly using graphs Bulk Procesing with State Errors in Contrail Pipeline Debugging via NewtContrail with Newt • Multi-staged pipeline with several MapReduce stages • Types of errors: • De Bruijn graph for a sample short input reads • Eulerian walk across the graph gives the genomic sequence Fail-stop Corrupted Outputs Suspicious Actors • Newt “fail” API triggered on crash - Reports crash culprits to Newt • Find inputs that lead to corruption or fail-stops - Prune selected inputs and replay entire pipeline • Crash avoidance: remove crash culprits immediately and continue - No replay overhead, transparent fault handling - How to handle removed inputs used in other dataflow paths? • Online identification of suspicious actor behavior based on actor’s history - Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1) - How much history? Backward Tracing Build GraphShort read files Graph refinement Genome sequence Porting completed Porting pending Assembler Technology CPU/RAM Velvet Serial 2 TB RAM ABySS MPI 168 cores x 96 hours SOAP denovo Pthreads 40 cores x 40 hours, > 140 GB RAM (total) Output data Input data Output AExtract links Count in-links: Site/URL frequency Merge w/seen Score and threshold D σ(D) state ΔD Output ΔAExtract links Count in-links: Site/URL frequency Merge w/seen Score and threshold state state 1.) Process Dataflow σ with input D 2.) Create state 3.) Process changes, ΔD, and prior state Updates, does not reprocess. Saves CPU, disk, network and energy! • Newt: Provenance manager • Capture fine-grained provenance in MapReduce jobs • Trace data provenance through multi-staged pipeline • Replay actors with selected inputs • Graph refinement phase • Build graph stage is stateful (saves src,dst information). • 10 MR stages (Contrail) mapped. • Graph refinement stage is stateful and iterative (refines the graph). • 30 MR stages (Contrail) to be mapped. Contrail (Schatz et al. 2009) uses Hadoop – builds a big graph(>3B nodes and >10B edges), iterates, scales but inefficient, slow (1.5 MB input takes 2 hours to assemble) and complex (40+ MR stages )!!!