White Paper: Hadoop in Life Sciences — An Introduction


Published on

This White Paper reviews the Apache Hadoop technology, its components — MapReduce and Hadoop Distributed File System — and its adoption in the life sciences with an example in Genomics data analysis.

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

White Paper: Hadoop in Life Sciences — An Introduction

  1. 1. White PaperHADOOP IN THE LIFE SCIENCES:An Introduction Abstract This introductory white paper reviews the Apache HadoopTM technology, its components – MapReduce and Hadoop Distributed File System (HDFS) – and its adoption in the Life Sciences with an example in Genomics data analysis. March 2012
  2. 2. Copyright © 2012 EMC Corporation. All Rights Reserved.EMC believes the information in this publication is accurate asof its publication date. The information is subject to changewithout notice.The information in this publication is provided “as is.” EMCCorporation makes no representations or warranties of any kindwith respect to the information in this publication, andspecifically disclaims implied warranties of merchantability orfitness for a particular purpose.Use, copying, and distribution of any EMC software described inthis publication requires an applicable software license.Part number h10574 Hadoop in the Life Sciences: An Introduction 2
  3. 3. Table of ContentsAudience ....................................................................................... 3  Executive Summary ........................................................................ 4  Hadoop: an Introduction ................................................................. 5  Genomics example: CrossBow .......................................................... 8  Enterprise-Class Hadoop on EMC Isilon ............................................. 9  Conclusion .................................................................................. 10  References .................................................................................. 10  AudienceThis white paper introduces the new data processing and analysis paradigm,HadoopTM, within the context of its usage in the life sciences, specifically GenomicsSequencing. It is intended for audiences with basic knowledge of storage andcomputing technology; a rudimentary understanding of DNA sequencing and thebioinformatics analysis associated with it. Hadoop in the Life Sciences: An Introduction 3
  4. 4. Executive SummaryLife Sciences data will reach the ExaByte (1018 bytes, EB) scale soon. This is “BigData”. As a reference point, all words ever spoken by all human beings whentranscribed are about 5 EB of data. In a recent article titled “Will Computers CrashGenomics?”1, the analysis points to exponential growth of the total genomicssequencing market capacity, as outlined in Figure 1 below: 10 Tera base-pairs (1012bp) per day, with an astounding 5x year-on-year growth rate (500%). The humangenome is approximately 3 billion base pairs long – a base pair (bp) comprising ofDNA molecules in G-C or A-T pairs Figure 1: Genomics GrowthEach base-pair represents a total of about 100 bytes (of raw, analyzed andinterpreted data). Therefore the genomics market capacity in 2010 storage terms(from Fig. 1) was about 200 PetaBytes (PB), with the capacity growing to about 1ExaByte (EB) by late 2012. This capacity is drowning out technologies attempting tohandle the deluge of Big Data in the life sciences. Proteomics (study of proteins) andimaging data are early stages of this exponential rise. It is not just the data storagevolume, but also its velocity and variability that make this a challenge requiring“scale-out” technologies: grow simply and painlessly as the data center and businessneeds grow. Within the past year, one computing and storage framework has maturedinto a contender to handle this tsunami of Big Data: Hadoop™.Life Sciences workflows require a High Performance Computing (HPC) infrastructure toprocess and analyze the data to determine the variations in the genome and theproper scale of storage to retain this data. With Next Generation (genome)Sequencing (NGS) workflows generating up to 2 TeraBytes (TB) of data per run perweek per sequencer – not including the raw images – the need for a scale-out storagethat integrates easily with HPC is a “line item requirement”. EMC Isilon has providedthe scale-out storage for nearly all the workflows for all the DNA sequencer instrumentmanufacturers in the market today at more than 150 customers. Since 2008, the EMCIsilon OneFS storage platform has a Life Sciences installed base of more than 65PetaBytes (PB). Hadoop in the Life Sciences: An Introduction 4
  5. 5. As genomics has very large, semi-structured, file-based data and is modeled on post-process streaming data access and I/O patterns that can be parallelized, it is ideallysuited for Hadoop. It consists of two main components: a file system and a computesystem – the Hadoop Distributed File System (HDFS) and the MapReduce frameworkrespectively. The Hadoop ecosystem consists of many open source tools, as shown inFigure 2 below: Figure 2: Hadoop ComponentsTo make the Hadoop storage “scale-out” and truly distributed, the EMC IsilonOneFS™ file system features connectivity to the Hadoop Distributed File System(HDFS) just like any other shared file system protocol: NFS, CIFS or SMB3. This allowsfor the data co-location of the storage with its compute nodes using the standardhigher level Java application programming interface (API) to build MapReduce “jobs”.Hadoop: an IntroductionHadoop was created by Doug Cutting of the Apache Lucene project4 initially as theNutch Distributed File System (NDFS), which was inspired by Google’s BigTable datainfrastructure and the MapReduce5 application layer in 2004. Hadoop is an Apache™Foundation derivative which is comprised of a MapReduce layer for data analysis anda Hadoop Distributed File System (HDFS) layer written in the Java programminglanguage to distribute and scale the MapReduce data.The Hadoop MapReduce framework runs on the compute cluster using the datastored on the HDFS. MapReduce jobs aim to provide a key/value based processingability in a highly parallelized fashion. Since the data is distributed over the cluster, aMapReduce job can be split-up to run many parallel processes over the data storedon the cluster. The Map parts of MapReduce only run on the data they can see – thatis the data blocks on the particular machine its running on. The Reduce bringstogether the output from the Maps. The result is a system that provides a highly- Hadoop in the Life Sciences: An Introduction 5
  6. 6. paralleled batch processing capability. The system scales well, since you just need toadd more hardware to increase its storage capability or decrease the time aMapReduce job takes to run.The partitioning of the storage and compute framework into master and worker nodetypes is outlined in the Figure 3 below: Figure 3: Hadoop ClusterHadoop is a Write Once Ready Many (WORM) system with no random writes. Thismakes Hadoop faster than HPC and Storage integrated separately. The life scienceshas been at the forefront of the technology adoption curve: one of the earliest use-cases of the Sun GridEngine6 HPC was the DNA sequence comparison BLAST16 search.Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV7. The R(statistical language) Hadoop interface, RHIPE8, is also popular in the life sciencescommunity.The HDFS layer has a “Name Node”, the controller, with “data locality” through thename node and uses the “share nothing” architecture – which is a distributedindependent node based scheme7.From a platform perspective, the OneFS HDFS interface is compatible with ApacheHadoop, EMC GreenPlum3 and Cloudera. In a traditional Hadoop implementation, theHDFS “Name Node” is a single point of failure since it is the sole keeper of all themetadata for all the data that lives in the filesystem – the OneFS HDFS interfaceresolves this by distributing the name node data3. HDFS creates a 3x replica forredundancy – OneFS drastically reduces the need for a 3x copy.A good example of the MapReduce algorithm “key-value” pair process for analyzingword count of specific words across documents9 is shown in Figure 3 below: Hadoop in the Life Sciences: An Introduction 6
  7. 7. Figure 4: Hadoop Example – word count across documentsHadoop is not suited for low-latency, “in process” use-cases like real-time, spectral orvideo analysis; or for large numbers of small files (<8KB). When small files have to beused, the Hadoop Archive (HAR) can be used to archive small files for processing.Since its early days, life sciences organizations have been Hadoop’s earliestadopters. Following the publication of the first Apache Hadoop project10 in January2008, the first large-scale MapReduce project was initiated by the Broad Institute –resulting in the comprehensive Genome Analysis Tool Kit (GATK)11. The Hadoop“CrossBow” project12 from Johns Hopkins University came soon after. Other projectsare Cloud-based: they include CloudBurst, Contrail, Myrna and CloudBLAST13. Aninteresting implementation is the NERSC (Department of Energy) Flash-based Hadoopcluster within the Magellan Science Cloud14. Hadoop in the Life Sciences: An Introduction 7
  8. 8. Genomics example: CrossBow The Hadoop ‘word count across documents’ example in Fig. 4 can be extended to DNA Sequencing: count for single base changes across millions of short DNA fragments and across hundreds of samples. A Single Nucleotide Polymorphism (SNP) occurs when one nucleotide (A, T, C or G) varies in the DNA sequence of members of the same biological species. Next Generation Sequencers (NGS) like Illumina® HiSeq can produce data in the order of 200 Giga base pairs in a single one-week run for a 60x human genome “coverage” – this means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. This data requires specialized software algorithms called “short read aligners”. CrossBow12 is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 5 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown in Figure 5 is a traditional N-node Hadoop cluster. 1. The Map step is the short read alignment algorithm, called BoWTie (Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. Figure 5: Crossbow example– SNP cal ls 2. The Sort step apportions the across DNA fragments alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset Hadoop in the Life Sciences: An Introduction 8
  9. 9. for that partition). The data here are the sorted alignments.3. The Reduce step calls SNPs for each reference genome partition. Many parallelinstances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP)run in the Hadoop cluster. Input tuples are sorted alignments for a partition and theoutput tuples are SNP calls.Results are stored via HDFS; then archived in SOAPsnp format.Enterprise-Class Hadoop on EMC IsilonAs demonstrated by previous examples, the data and analysis scalability required forGenomics is ideally suited for Hadoop. EMC Isilon’s OneFS distributes the HadoopName Node to provide high availability and load balancing, thereby eliminating thesingle point of failure. The Isilon NAS storage solution provides a highly efficientsingle file system/single volume, scalable up to 15 PB. Data can be staged from otherprotocols to HDFS using OneFS as a staging gateway. EMC Isilon provides EnterpriseGrade data services to the Hadoop infrastructure via SnapshotIQ and SyncIQ foradvanced backup and disaster recovery capabilities.The equation for Hadoop scalability can be represented as: Big(Data + Analytics) = Hadoop EMC:IsilonThese advantages are summarized in Fig. 6 below: Figure 6: Hadoop advantages with EMC IsilonWhen combined the EMC GreenPlum Analytics appliance and solution17, the Hadooparchitecture becomes a complete Enterprise package. Hadoop in the Life Sciences: An Introduction 9
  10. 10. ConclusionWhat began as an internal project at Google in 2004 has now matured into a scalableframework for two computing paradigms that are particularly suited for the lifesciences: parallelization and distribution. The post-processing streaming datapatterns for text strings, clustering and sorting – the core process patterns in the lifesciences – are ideal workflows for Hadoop. The CrossBow example discussed abovealigned Illumina NGS reads for SNP calling over a ‘35x’ coverage of the humangenome in under 3 hours using a 40-node Hadoop cluster; an order of magnitudebetter than traditional HPC technology for parallel processes.Even though Hadoop implementations in the Cloud are popular on the Public Cloudinstances, several issues have resulted in most large institutions maintaining theirown data repositories internally: large data transfer from the on-premise storage tothe Cloud; data regulations and security; data availability; data redundancy and HPCthroughput. This is especially true as genome sequencing moves into the Clinic fordiagnostic testing.The convergence of these issues is evidenced by the mirroring of Short Readsequence Archive (SRA) at the National Center for Biotechnology Information (NCBI)on the DNANexus’ SRA Cloud15 – its business model is slowly evolving into a ‘full dataand analysis offsite’ model via Hadoop. The Hybrid Cloud model (a source datamirror between Private Cloud and Community Cloud) with Hadoop as a Service (HaaS)is the current state-of-the-art.Hadoop’s advantages far outweigh its challenges – it is ready to become the lifesciences analytics framework of the future. The EMC Isilon platform is bringing thatfuture to you today.References1. Pennisi, E; Science 11 February 2011: Vol. 331 no. 6018 pp. 666-6682. Editorial, “Challenges and Opportunities”, Science 11 February 2011: Vol. 331 no. 6018 pp 692.3. Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h105284. Cafarella, M and Cutting D, “Building Nutch, Open Source Search”, ACM Queue vol. 2, no. 2, April 2004.5. Dean J and Ghemawat S, "MapReduce: Simplfied Data Processing on Large Clusters", OSDI conference proceedings, 2004.6. Vasiliu B, “Integrating BLAST with Sun GridEngine”, July 2003, http://developers.sun.com/solaris/articles/integrating_blast.html, last visited Dec 2011.7. White, Tom: “Hadoop -- The Definitive Guide” 2nd Edition, Published by O’Reilly, Oct 20108. RHIPE: http://ml.stat.purdue.edu/rhipe/, last visited Dec 2011 Hadoop in the Life Sciences: An Introduction 10
  11. 11. 9. MapReduce example: http://markusklems.files.wordpress.com/2008/07/mapreduce.png , last visited Dec 2011.10. “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr 2009, http://sortbenchmark.org/YahooHadoop.pdf, http://sortbenchmark.org/Yahoo2009.pdf last accessed Dec 201111. McKenna A, et al, "The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data", Genome Research, 20:1297– 1303, July 2010.12. Langmead B, Schatz MC, et al, “Human SNPs from short reads in hours using cloud computing” Poster Presentation, WABI Sep 2009, http://www.cbcb.umd.edu/~mschatz/Posters/Crossbow_WABI_Sept2009.pdf, last accessed Dec 2011.13. Taylor RC, "An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics" BMC Bioinformatics 2010, 11(Suppl 12):S1, http://www.biomedcentral.com/1471-2105/11/S12/S1 , last accessed Dec 2011.14. Ramakrishnan L, “Evaluating Cloud Computing for HPC Applications”, DoE NeRSC, http://www.nersc.gov/assets/Events/MagellanNERSCLunchTalk.pdf, last accessed Dec 2011.15. “DNAnexus to mirror SRA database in Google Cloud”, BioIT World, Page 41, http://www.bio-itworld.com/uploadedFiles/Bio- IT_World/1111BITW_download.pdf , last visited Dec 2011.16. Altschul SF, et al, "Basic local alignment search tool". J Mol Biol 215 (3): 403– 410, October 1990.17. Lockner J.,"EMC’s Enterprise Hadoop Solution: Isilon Scale-out NAS and GreenPlum HD", White Paper, The Enterprise Strategy Group, Inc (ESG), February 2012 Hadoop in the Life Sciences: An Introduction 11