CNS_poster12

Optimizing Large Genome Assembly in the Cloud
Apurva Kumar, Soumyarupa De and Kenneth Yocum
University of California, San Diego
Genomic Analytics
Data
• High Throughput Sequencing for human (African male
NA18507) produce : 3.5 billions 36 base pairs (bp) reads =
756 MB. (Cost : 1000 USD)
Current Problem : To assemble genome in lesser cost and
minimal time .
Why this problem ?
• Only 60 complete human genomes publicly exist(including
Steve Jobs who got it sequenced for 100K USD)
• Commercialization is still in early stage and growing rapidly
Question How do we scale this 3.5 B bp input data to
compute the final genome sequence ?
Optimizing using Stateful Bulk Processing
Newt Lineage
Current Assembly Techniques
Make use of Stateful Continuous Bulk
Processing (aka CBP) : Efficient stateful
graph processing model (supports even
Google’s “Pregel”) and run it on top of
Azure.
Output data
Contrail implemented as CBP StagesInput flows
Output flows
Fin
edges
Fout
edges
T( , ΔFstate, ΔF1)
Fout
state
Fin
state
key
Δin
Δout
Translate function for a stage with
state as loopback flow
Our Approach
Aim : To assemble a human genome (3.5 B 36 bp)
Genomic Analysis
Graph Processing
CBP on Azure (+ Newt (Debug))
Genome assembly using graphs Bulk Procesing with State
Errors in Contrail Pipeline Debugging via NewtContrail with Newt
• Multi-staged pipeline with several
MapReduce stages
• Types of errors:
• De Bruijn graph for a
sample short input
reads
• Eulerian walk across
the graph gives the
genomic sequence
Fail-stop
Corrupted
Outputs
Suspicious
Actors
• Newt “fail” API triggered on crash
- Reports crash culprits to Newt
• Find inputs that lead to corruption or
fail-stops
- Prune selected inputs and replay entire
pipeline
• Crash avoidance: remove crash culprits immediately and
continue
- No replay overhead, transparent fault handling
- How to handle removed inputs used in other dataflow paths?
• Online identification of suspicious actor behavior based
on actor’s history
- Processing rate (too slow, too fast), selectivity (n-to-1 instead of 1-to-1)
- How much history?
Backward Tracing
Build
GraphShort read
files
Graph
refinement
Genome
sequence
Porting completed
Porting pending
Assembler Technology CPU/RAM
Velvet Serial 2 TB RAM
ABySS MPI 168 cores x 96 hours
SOAP denovo Pthreads 40 cores x 40 hours, > 140
GB RAM (total)
Output data
Input data Output AExtract links
Count in-links:
Site/URL frequency
Merge w/seen
Score and
threshold
D σ(D)
state
ΔD Output ΔAExtract
links
Count in-links:
Site/URL frequency
Merge w/seen
Score and
threshold
state state
1.) Process Dataflow σ with input D
2.) Create state
3.) Process changes, ΔD, and prior state
Updates, does not reprocess.
Saves CPU, disk, network and energy!
• Newt: Provenance
manager
• Capture fine-grained
provenance in
MapReduce jobs
• Trace data
provenance through
multi-staged pipeline
• Replay actors with
selected inputs
• Graph refinement
phase • Build graph stage is stateful
(saves src,dst information).
• 10 MR stages (Contrail)
mapped.
• Graph refinement stage is
stateful and iterative (refines
the graph).
• 30 MR stages (Contrail) to be
mapped.
Contrail
(Schatz et al. 2009) uses Hadoop – builds a big
graph(>3B nodes and >10B edges), iterates, scales
but inefficient, slow (1.5 MB input takes 2 hours to
assemble) and complex (40+ MR stages )!!!

CNS_poster12

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Viewers also liked

Viewers also liked (10)

Similar to CNS_poster12

Similar to CNS_poster12 (20)

CNS_poster12