White Paper: Hadoop in Life Sciences — An Introduction

White Paper

HADOOP IN THE LIFE SCIENCES:
An Introduction

Abstract
This introductory white paper reviews the Apache HadoopTM
technology, its components – MapReduce and Hadoop
Distributed File System (HDFS) – and its adoption in the Life
Sciences with an example in Genomics data analysis.

March 2012

Copyright © 2012 EMC Corporation. All Rights Reserved.

EMC believes the information in this publication is accurate as
of its publication date. The information is subject to change
without notice.

The information in this publication is provided “as is.” EMC
Corporation makes no representations or warranties of any kind
with respect to the information in this publication, and
specifically disclaims implied warranties of merchantability or
fitness for a particular purpose.

Use, copying, and distribution of any EMC software described in
this publication requires an applicable software license.

Part number h10574

Hadoop in the Life Sciences: An Introduction 2

Table of Contents
Audience ....................................................................................... 3

Executive Summary ........................................................................ 4

Hadoop: an Introduction ................................................................. 5

Genomics example: CrossBow .......................................................... 8

Enterprise-Class Hadoop on EMC Isilon ............................................. 9

Conclusion .................................................................................. 10

References .................................................................................. 10

Audience
This white paper introduces the new data processing and analysis paradigm,
HadoopTM, within the context of its usage in the life sciences, specifically Genomics
Sequencing. It is intended for audiences with basic knowledge of storage and
computing technology; a rudimentary understanding of DNA sequencing and the
bioinformatics analysis associated with it.


Executive Summary
Life Sciences data will reach the ExaByte (1018 bytes, EB) scale soon. This is “Big
Data”. As a reference point, all words ever spoken by all human beings when
transcribed are about 5 EB of data. In a recent article titled “Will Computers Crash
Genomics?”1, the analysis points to exponential growth of the total genomics
sequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (1012
bp) per day, with an astounding 5x year-on-year growth rate (500%). The human
genome is approximately 3 billion base pairs long – a base pair (bp) comprising of
DNA molecules in G-C or A-T pairs

Figure 1: Genomics Growth

Each base-pair represents a total of about 100 bytes (of raw, analyzed and
interpreted data). Therefore the genomics market capacity in 2010 storage terms
(from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1
ExaByte (EB) by late 2012. This capacity is drowning out technologies attempting to
handle the deluge of Big Data in the life sciences. Proteomics (study of proteins) and
imaging data are early stages of this exponential rise. It is not just the data storage
volume, but also its velocity and variability that make this a challenge requiring
“scale-out” technologies: grow simply and painlessly as the data center and business
needs grow. Within the past year, one computing and storage framework has matured
into a contender to handle this tsunami of Big Data: Hadoop™.
Life Sciences workflows require a High Performance Computing (HPC) infrastructure to
process and analyze the data to determine the variations in the genome and the
proper scale of storage to retain this data. With Next Generation (genome)
Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run per
week per sequencer – not including the raw images – the need for a scale-out storage
that integrates easily with HPC is a “line item requirement”. EMC Isilon has provided
the scale-out storage for nearly all the workflows for all the DNA sequencer instrument
manufacturers in the market today at more than 150 customers. Since 2008, the EMC
Isilon OneFS storage platform has a Life Sciences installed base of more than 65
PetaBytes (PB).


As genomics has very large, semi-structured, file-based data and is modeled on post-
process streaming data access and I/O patterns that can be parallelized, it is ideally
suited for Hadoop. It consists of two main components: a file system and a compute
system – the Hadoop Distributed File System (HDFS) and the MapReduce framework
respectively. The Hadoop ecosystem consists of many open source tools, as shown in
Figure 2 below:

Figure 2: Hadoop Components

To make the Hadoop storage “scale-out” and truly distributed, the EMC Isilon
OneFS™ file system features connectivity to the Hadoop Distributed File System
(HDFS) just like any other shared file system protocol: NFS, CIFS or SMB3. This allows
for the data co-location of the storage with its compute nodes using the standard
higher level Java application programming interface (API) to build MapReduce “jobs”.

Hadoop: an Introduction
Hadoop was created by Doug Cutting of the Apache Lucene project4 initially as the
Nutch Distributed File System (NDFS), which was inspired by Google’s BigTable data
infrastructure and the MapReduce5 application layer in 2004. Hadoop is an Apache™
Foundation derivative which is comprised of a MapReduce layer for data analysis and
a Hadoop Distributed File System (HDFS) layer written in the Java programming
language to distribute and scale the MapReduce data.
The Hadoop MapReduce framework runs on the compute cluster using the data
stored on the HDFS. MapReduce 'jobs' aim to provide a key/value based processing
ability in a highly parallelized fashion. Since the data is distributed over the cluster, a
MapReduce job can be split-up to run many parallel processes over the data stored
on the cluster. The Map parts of MapReduce only run on the data they can see – that
is the data blocks on the particular machine its running on. The Reduce brings
together the output from the Maps. The result is a system that provides a highly-


paralleled batch processing capability. The system scales well, since you just need to
add more hardware to increase its storage capability or decrease the time a
MapReduce job takes to run.
The partitioning of the storage and compute framework into master and worker node
types is outlined in the Figure 3 below:

Figure 3: Hadoop Cluster

Hadoop is a Write Once Ready Many (WORM) system with no random writes. This
makes Hadoop faster than HPC and Storage integrated separately. The life sciences
has been at the forefront of the technology adoption curve: one of the earliest use-
cases of the Sun GridEngine6 HPC was the DNA sequence comparison BLAST16 search.
Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV7. The R
(statistical language) Hadoop interface, RHIPE8, is also popular in the life sciences
community.
The HDFS layer has a “Name Node”, the controller, with “data locality” through the
name node and uses the “share nothing” architecture – which is a distributed
independent node based scheme7.
From a platform perspective, the OneFS HDFS interface is compatible with Apache
Hadoop, EMC GreenPlum3 and Cloudera. In a traditional Hadoop implementation, the
HDFS “Name Node” is a single point of failure since it is the sole keeper of all the
metadata for all the data that lives in the filesystem – the OneFS HDFS interface
resolves this by distributing the name node data3. HDFS creates a 3x replica for
redundancy – OneFS drastically reduces the need for a 3x copy.
A good example of the MapReduce algorithm “key-value” pair process for analyzing
word count of specific words across documents9 is shown in Figure 3 below:


Figure 4: Hadoop Example – word count across documents

Hadoop is not suited for low-latency, “in process” use-cases like real-time, spectral or
video analysis; or for large numbers of small files (<8KB). When small files have to be
used, the Hadoop Archive (HAR) can be used to archive small files for processing.
Since its early days, life sciences organizations have been Hadoop’s earliest
adopters. Following the publication of the first Apache Hadoop project10 in January
2008, the first large-scale MapReduce project was initiated by the Broad Institute –
resulting in the comprehensive Genome Analysis Tool Kit (GATK)11. The Hadoop
“CrossBow” project12 from Johns Hopkins University came soon after. Other projects
are Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST13. An
interesting implementation is the NERSC (Department of Energy) Flash-based Hadoop
cluster within the Magellan Science Cloud14.


Genomics example: CrossBow
The Hadoop ‘word count across
documents’ example in Fig. 4 can be
extended to DNA Sequencing: count for
single base changes across millions of
short DNA fragments and across
hundreds of samples.
A Single Nucleotide Polymorphism (SNP)
occurs when one nucleotide (A, T, C or G)
varies in the DNA sequence of members
of the same biological species. Next
Generation Sequencers (NGS) like
Illumina® HiSeq can produce data in the
order of 200 Giga base pairs in a single
one-week run for a 60x human genome
“coverage” – this means that each base
was present on an average of 60 reads.
The larger the coverage, the more
statistically significant is the result. This
data requires specialized software
algorithms called “short read aligners”.
CrossBow12 is a combination of several
algorithms that provide SNP calling and
short read alignment, which are common
tasks in NGS. Figure 5 alongside explains
the steps necessary to process genome
data to look for SNPs.
The Map-Sort-Reduce process is ideally
suited for a Hadoop framework. The
cluster as shown in Figure 5 is a
traditional N-node Hadoop cluster.
1. The Map step is the short read
alignment algorithm, called BoWTie
(Burrows Wheeler Transform, BWT).
Multiple instances of BoWTie are run in
parallel in Hadoop. The input tuples (an
ordered list of elements) are the
sequence reads and the output tuples are
the alignments of the short reads.
Figure 5: Crossbow
example– SNP cal ls 2. The Sort step apportions the
across DNA fragments alignments according to a primary key
(the genome partition) and sorts based
on a secondary key (which is the offset


for that partition). The data here are the sorted alignments.
3. The Reduce step calls SNPs for each reference genome partition. Many parallel
instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP)
run in the Hadoop cluster. Input tuples are sorted alignments for a partition and the
output tuples are SNP calls.
Results are stored via HDFS; then archived in SOAPsnp format.

Enterprise-Class Hadoop on EMC Isilon
As demonstrated by previous examples, the data and analysis scalability required for
Genomics is ideally suited for Hadoop. EMC Isilon’s OneFS distributes the Hadoop
Name Node to provide high availability and load balancing, thereby eliminating the
single point of failure. The Isilon NAS storage solution provides a highly efficient
single file system/single volume, scalable up to 15 PB. Data can be staged from other
protocols to HDFS using OneFS as a staging gateway. EMC Isilon provides Enterprise
Grade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ for
advanced backup and disaster recovery capabilities.
The equation for Hadoop scalability can be represented as:
Big(Data + Analytics) = Hadoop EMC:Isilon
These advantages are summarized in Fig. 6 below:

Figure 6: Hadoop advantages with EMC Isilon

When combined the EMC GreenPlum Analytics appliance and solution17, the Hadoop
architecture becomes a complete Enterprise package.


Conclusion
What began as an internal project at Google in 2004 has now matured into a scalable
framework for two computing paradigms that are particularly suited for the life
sciences: parallelization and distribution. The post-processing streaming data
patterns for text strings, clustering and sorting – the core process patterns in the life
sciences – are ideal workflows for Hadoop. The CrossBow example discussed above
aligned Illumina NGS reads for SNP calling over a ‘35x’ coverage of the human
genome in under 3 hours using a 40-node Hadoop cluster; an order of magnitude
better than traditional HPC technology for parallel processes.
Even though Hadoop implementations in the Cloud are popular on the Public Cloud
instances, several issues have resulted in most large institutions maintaining their
own data repositories internally: large data transfer from the on-premise storage to
the Cloud; data regulations and security; data availability; data redundancy and HPC
throughput. This is especially true as genome sequencing moves into the Clinic for
diagnostic testing.
The convergence of these issues is evidenced by the mirroring of Short Read
sequence Archive (SRA) at the National Center for Biotechnology Information (NCBI)
on the DNANexus’ SRA Cloud15 – its business model is slowly evolving into a ‘full data
and analysis offsite’ model via Hadoop. The Hybrid Cloud model (a source data
mirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS)
is the current state-of-the-art.
Hadoop’s advantages far outweigh its challenges – it is ready to become the life
sciences analytics framework of the future. The EMC Isilon platform is bringing that
future to you today.

References
1. Pennisi, E; Science 11 February 2011: Vol. 331 no. 6018 pp. 666-668
2. Editorial, “Challenges and Opportunities”, Science 11 February 2011: Vol. 331 no.
6018 pp 692.
3. Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h10528
4. Cafarella, M and Cutting D, “Building Nutch, Open Source Search”, ACM Queue
vol. 2, no. 2, April 2004.
5. Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large
Clusters", OSDI conference proceedings, 2004.
6. Vasiliu B, “Integrating BLAST with Sun GridEngine”, July 2003,
http://developers.sun.com/solaris/articles/integrating_blast.html, last visited
Dec 2011.
7. White, Tom: “Hadoop -- The Definitive Guide” 2nd Edition, Published by O’Reilly,
Oct 2010
8. RHIPE: http://ml.stat.purdue.edu/rhipe/, last visited Dec 2011


9. MapReduce example:
http://markusklems.files.wordpress.com/2008/07/mapreduce.png , last visited
Dec 2011.
10. “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr 2009,
http://sortbenchmark.org/YahooHadoop.pdf,
http://sortbenchmark.org/Yahoo2009.pdf last accessed Dec 2011
11. McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for
analyzing next-generation DNA sequencing data", Genome Research, 20:1297–
1303, July 2010.
12. Langmead B, Schatz MC, et al, “Human SNPs from short reads in hours using
cloud computing” Poster Presentation, WABI Sep 2009,
http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf,
last accessed Dec 2011.
13. Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its
current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl
12):S1, http://www.biomedcentral.com/1471-2105/11/S12/S1 , last accessed
Dec 2011.
14. Ramakrishnan L, “Evaluating Cloud Computing for HPC Applications”, DoE
NeRSC, http://www.nersc.gov/assets/Events/MagellanNERSCLunchTalk.pdf, last
accessed Dec 2011.
15. “DNAnexus to mirror SRA database in Google Cloud”, BioIT World, Page 41,
http://www.bio-itworld.com/uploadedFiles/Bio-
IT_World/1111BITW_download.pdf , last visited Dec 2011.
16. Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): 403–
410, October 1990.
17. Lockner J.,"EMC’s Enterprise Hadoop Solution: Isilon Scale-out NAS and
GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February
2012


White Paper: Hadoop in Life Sciences — An Introduction

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to White Paper: Hadoop in Life Sciences — An Introduction

Similar to White Paper: Hadoop in Life Sciences — An Introduction (20)

More from EMC

More from EMC (20)

Recently uploaded

Recently uploaded (20)

White Paper: Hadoop in Life Sciences — An Introduction