Cloud Computing
Technologies for Genomic
Big Data Analysis
Fabrício A. B. Silva, Alberto Davila
FIOCRUZ
{fabs,davila}@fiocruz.br
Big Data – A Definition
“Big data is a term used to describe
information assemblages that make
conventional data, or database, processing
problematic due to any combination of
their size (volume), frequency of update
(velocity), or diversity (variety)”
Hay SI, George DB, Moyes CL, Brownstein JS (2013) Big Data
Opportunities for Global Infectious Disease Surveillance. PLoS Med
10(4): e1001413. doi:10.1371/journal.pmed.1001413
The Data Deluge
“In the last five years, more scientific data
has been generated than in the entire
history of mankind. You can imagine
what’s going to happen in the next five.”
Winston Hide, associate professor of bioinformatics
Harvard School of Public Health.
The promise of big data. HSPH News, Spring/Summer 2012
Exemple: Genbank

http://www.ncbi.nlm.nih.gov/genbank/statistics
Accessed on Oct 22, 2013
DNA Sequencing Evolution

Stein, L. D. (2010). The case for cloud computing in genome
informatics. Genome Biol, 11(5), 207.
Interesting Facts...
• Sequencing a human genome has
decreased in cost from US$ 1 million in
2007 to US$1 thousand in 2012
• An human DNA has 3 billion bp ~ 100
GB of raw data
• NCI’s million genomes project: 1 million
TB, or 1000 petabyte, or 1 Exabyte
Driscoll, A. O., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop
and cloud computing in genomics. Journal of biomedical informatics.
The Processing Bottleneck
Number
Software of Cores

Start

Finish

Processing Time

File sizes

Flash

24

9/12/13 22:48

9/12/13 22:48

0:00:53

2 files: 237 Mb and 238 Mb

Velveth

1

9/12/13 22:50

9/12/13 22:52

0:01:39

3 files: 100 Mb, 166 Mb and 165 Mb

Velvetg

1

9/12/13 22:54

9/12/13 22:59

0:04:53

2 files: 250 Mb and 75 Mb

Mira

24

9/12/13 23:11

9/12/13 23:32

0:21:21

2 files: 69 Mb and 6 Mb

Glimmer3

1

9/12/13 23:40

9/12/13 23:40

0:00:40

2 files: 6 Mb and 1.4 Mb

Blastx

24

9/12/13 23:46

9/13/13 9:23

9:36:15

Against RefSeq (17.411.217 enries)

Pipeline processed @ Computational and Systems Biology Lab, Bioinformatics Platform, Instituto
Oswaldo Cruz, FIOCRUZ – Input Data size: 500MB
NGS: Expect Much More Data
12

10

8
Coluna 1
Coluna 2
Coluna 3

6

4

2

0
Linha 1

Linha 2

Linha 3

Linha 4
What Then?
Cloud Computing: a
Definition
• “Cloud computing is a model for
enabling convenient, on-demand network
access to a shared pool of configurable
computing resources (e.g., networks,
servers, storage, applications, and
services) that can be rapidly provisioned
and released with minimal management
effort or service provider interaction”
NIST – Available at http://www.nist.gov/itl/cloud/upload/cloud-def-v15.pdf
Cloud Computing:
Advantages
• Flexibility
– Use of virtualization technology

• Scalability
– Large number of nodes with local speed
connection

• Availability/Accessibility
– Even small labs can harness the power of the
Cloud
Cloud Scalability: Example

Schadt, E. E., Linderman, M. D., Sorenson, J., Lee, L., & Nolan, G. P. (2011). Cloud
and heterogeneous computing solutions exist today for the emerging big data problems
in biology. Nature Reviews Genetics, 12(3), 224-224.
Cloud Computing:
Challenges
• Bandwidth Limits
– Large data sets needs to be moved to the
cloud

• Security/Privacy Issues
– Limited control over remote storage

• Expertise
– Adapting new applications to the cloud still
requires some technical expertise
MapReduce
• MapReduce/Hadoop
– MapReduce: Parallel distributed framework
invented by Google for processing large data sets
– Data and computations are spread over thousands
of computers, processing petabytes of data each
day
– Hadoop is the leading open-source implementation
MapReduce
• MapReduce/Hadoop: Advantages
– Scalable, Efficient, Reliable
– Easy to program
– Runs on commodity computers

• MapReduce/Hadoop: Challenges
– Redesigning, retooling applications
Cloud Computing in
Genomics
• Crossbow
– Scalable software pipeline for whole genome
resequencing analysis over Hadoop

• CloudBurst
– Highly sensitive short read mapping over Hadoop

• Myrna
– Tool for calculating differential gene expression in large
RNA-seq datasets over Hadoop
Cloud Computing in
Genomics
• Contrail
– De novo assembly of large genomes over Hadoop

• CloudBlast
– Scalable BLAST over Hadoop

• Quake
– DNA sequence error detection and correction in sequence
reads over Hadoop
Cloud Computing in
Genomics
• More examples of Hadoop based apps:
– CloudAligner
– BlastReduce
– CloudBrush
– GATK
– Nephele
– BlueSNP
– Etc…
Crossbow: Hadoop
Streaming

Langmead, B., Schatz, M. C., Lin, J., Pop, M., & Salzberg, S. L. (2009). Searching for
SNPs with cloud computing. Genome Biol, 10(11), R134.
Crossbow: Hadoop
Streaming
1. Map (Bowtie): many sequencing reads are
mapped to the reference genome in parallel.
2. Shuffle: the sequence alignments are
aggregated so that all alignments on the same
chromosome or locus are grouped together
and sorted by position.
3. Reduce/Scan
(SOAPsnp):
the
sorted
alignments are scanned to identify SNPs
(Single Nucleotide Polymorphism) within each
region.
Cloud-enabled
Technologies
• Apache HBase
– Open
source,
non-relational,
distributed database modeled after
Google's BigTable. It runs on top of
HDFS
(Hadoop
Distributed
Filesystem), providing BigTable-like
capabilities for Hadoop
Cloud-enabled
Technologies
• Apache Cassandra
– Linear scalable and high available
database that can run on commodity
hardware or cloud infrastructure,
with support for replication across
multiple datacenters.

• Google's Pregel/Apache
Giraph
– Iterative graph processing system
built for high scalability
Cloud-enabled
Technologies
• Apache Hive
– data warehouse system for Hadoop
that
facilitates
easy
data
summarization, ad-hoc queries, and
the analysis of large datasets

• Apache Pig
– high-level language for expressing
data analysis programs, coupled with
evaluation
infrastructure
over
Hadoop
Parallel Patterns for the
Cloud
• Stream-oriented
– Farm
– Farm with feedback
– Pipeline

• Data-parallel
– Map
– Reduce
Pipeline Pattern: Stingray@Galaxy
Multiple Parallel Patterns

Aldinucci, Marco, et al. Parallel stochastic systems biology in the cloud. Briefings in
Bioinformatics (2013).
But...Our group do not have the expertise to develop
our own Cloud applications...
Can we still use the Cloud/Mapreduce for genomic
processing?
Galaxy Cloudman
Cloudgene

Schönherr, S. et al. (2012). Cloudgene: A graphical execution platform for
MapReduce programs on private and public clouds. BMC bioinformatics, 13(1),
200.
What's Next?
• Beyond Hadoop
– Adoption of new technologies/parallel
patterns for genomic data analysis in the
cloud

• Scalable Data Storage
– High Availability/Support for replication
– Preliminary work on HBase by Intel

• Private/Hybrid/Corporate Clouds
– Privacy/security issues
– Data tenancy
Thank You!!!
Acknowledgements: Nelson Kotowski, Rodrigo Jardim (FIOCRUZ)

Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis

  • 1.
    Cloud Computing Technologies forGenomic Big Data Analysis Fabrício A. B. Silva, Alberto Davila FIOCRUZ {fabs,davila}@fiocruz.br
  • 2.
    Big Data –A Definition “Big data is a term used to describe information assemblages that make conventional data, or database, processing problematic due to any combination of their size (volume), frequency of update (velocity), or diversity (variety)” Hay SI, George DB, Moyes CL, Brownstein JS (2013) Big Data Opportunities for Global Infectious Disease Surveillance. PLoS Med 10(4): e1001413. doi:10.1371/journal.pmed.1001413
  • 3.
    The Data Deluge “Inthe last five years, more scientific data has been generated than in the entire history of mankind. You can imagine what’s going to happen in the next five.” Winston Hide, associate professor of bioinformatics Harvard School of Public Health. The promise of big data. HSPH News, Spring/Summer 2012
  • 4.
  • 5.
    DNA Sequencing Evolution Stein,L. D. (2010). The case for cloud computing in genome informatics. Genome Biol, 11(5), 207.
  • 6.
    Interesting Facts... • Sequencinga human genome has decreased in cost from US$ 1 million in 2007 to US$1 thousand in 2012 • An human DNA has 3 billion bp ~ 100 GB of raw data • NCI’s million genomes project: 1 million TB, or 1000 petabyte, or 1 Exabyte Driscoll, A. O., Daugelaite, J., & Sleator, R. D. (2013). ‘Big data’, Hadoop and cloud computing in genomics. Journal of biomedical informatics.
  • 7.
    The Processing Bottleneck Number Softwareof Cores Start Finish Processing Time File sizes Flash 24 9/12/13 22:48 9/12/13 22:48 0:00:53 2 files: 237 Mb and 238 Mb Velveth 1 9/12/13 22:50 9/12/13 22:52 0:01:39 3 files: 100 Mb, 166 Mb and 165 Mb Velvetg 1 9/12/13 22:54 9/12/13 22:59 0:04:53 2 files: 250 Mb and 75 Mb Mira 24 9/12/13 23:11 9/12/13 23:32 0:21:21 2 files: 69 Mb and 6 Mb Glimmer3 1 9/12/13 23:40 9/12/13 23:40 0:00:40 2 files: 6 Mb and 1.4 Mb Blastx 24 9/12/13 23:46 9/13/13 9:23 9:36:15 Against RefSeq (17.411.217 enries) Pipeline processed @ Computational and Systems Biology Lab, Bioinformatics Platform, Instituto Oswaldo Cruz, FIOCRUZ – Input Data size: 500MB
  • 8.
    NGS: Expect MuchMore Data 12 10 8 Coluna 1 Coluna 2 Coluna 3 6 4 2 0 Linha 1 Linha 2 Linha 3 Linha 4
  • 9.
  • 10.
    Cloud Computing: a Definition •“Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” NIST – Available at http://www.nist.gov/itl/cloud/upload/cloud-def-v15.pdf
  • 11.
    Cloud Computing: Advantages • Flexibility –Use of virtualization technology • Scalability – Large number of nodes with local speed connection • Availability/Accessibility – Even small labs can harness the power of the Cloud
  • 12.
    Cloud Scalability: Example Schadt,E. E., Linderman, M. D., Sorenson, J., Lee, L., & Nolan, G. P. (2011). Cloud and heterogeneous computing solutions exist today for the emerging big data problems in biology. Nature Reviews Genetics, 12(3), 224-224.
  • 13.
    Cloud Computing: Challenges • BandwidthLimits – Large data sets needs to be moved to the cloud • Security/Privacy Issues – Limited control over remote storage • Expertise – Adapting new applications to the cloud still requires some technical expertise
  • 14.
    MapReduce • MapReduce/Hadoop – MapReduce:Parallel distributed framework invented by Google for processing large data sets – Data and computations are spread over thousands of computers, processing petabytes of data each day – Hadoop is the leading open-source implementation
  • 15.
    MapReduce • MapReduce/Hadoop: Advantages –Scalable, Efficient, Reliable – Easy to program – Runs on commodity computers • MapReduce/Hadoop: Challenges – Redesigning, retooling applications
  • 16.
    Cloud Computing in Genomics •Crossbow – Scalable software pipeline for whole genome resequencing analysis over Hadoop • CloudBurst – Highly sensitive short read mapping over Hadoop • Myrna – Tool for calculating differential gene expression in large RNA-seq datasets over Hadoop
  • 17.
    Cloud Computing in Genomics •Contrail – De novo assembly of large genomes over Hadoop • CloudBlast – Scalable BLAST over Hadoop • Quake – DNA sequence error detection and correction in sequence reads over Hadoop
  • 18.
    Cloud Computing in Genomics •More examples of Hadoop based apps: – CloudAligner – BlastReduce – CloudBrush – GATK – Nephele – BlueSNP – Etc…
  • 19.
    Crossbow: Hadoop Streaming Langmead, B.,Schatz, M. C., Lin, J., Pop, M., & Salzberg, S. L. (2009). Searching for SNPs with cloud computing. Genome Biol, 10(11), R134.
  • 20.
    Crossbow: Hadoop Streaming 1. Map(Bowtie): many sequencing reads are mapped to the reference genome in parallel. 2. Shuffle: the sequence alignments are aggregated so that all alignments on the same chromosome or locus are grouped together and sorted by position. 3. Reduce/Scan (SOAPsnp): the sorted alignments are scanned to identify SNPs (Single Nucleotide Polymorphism) within each region.
  • 21.
    Cloud-enabled Technologies • Apache HBase –Open source, non-relational, distributed database modeled after Google's BigTable. It runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop
  • 22.
    Cloud-enabled Technologies • Apache Cassandra –Linear scalable and high available database that can run on commodity hardware or cloud infrastructure, with support for replication across multiple datacenters. • Google's Pregel/Apache Giraph – Iterative graph processing system built for high scalability
  • 23.
    Cloud-enabled Technologies • Apache Hive –data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets • Apache Pig – high-level language for expressing data analysis programs, coupled with evaluation infrastructure over Hadoop
  • 24.
    Parallel Patterns forthe Cloud • Stream-oriented – Farm – Farm with feedback – Pipeline • Data-parallel – Map – Reduce
  • 25.
  • 26.
    Multiple Parallel Patterns Aldinucci,Marco, et al. Parallel stochastic systems biology in the cloud. Briefings in Bioinformatics (2013).
  • 27.
    But...Our group donot have the expertise to develop our own Cloud applications... Can we still use the Cloud/Mapreduce for genomic processing?
  • 28.
  • 29.
    Cloudgene Schönherr, S. etal. (2012). Cloudgene: A graphical execution platform for MapReduce programs on private and public clouds. BMC bioinformatics, 13(1), 200.
  • 30.
    What's Next? • BeyondHadoop – Adoption of new technologies/parallel patterns for genomic data analysis in the cloud • Scalable Data Storage – High Availability/Support for replication – Preliminary work on HBase by Intel • Private/Hybrid/Corporate Clouds – Privacy/security issues – Data tenancy
  • 31.
    Thank You!!! Acknowledgements: NelsonKotowski, Rodrigo Jardim (FIOCRUZ)

Editor's Notes

  • #2 {"4":"O número de bases no Genbank dobra a cada 18 meses\n"}