Whitepaper : CHI: Hadoop's Rise in Life Sciences


Published on

Genomics large, semi-structured, file-based data is ideally suited for a Hadoop Distributed File System. The EMC Isilon OneFS file system features connectivity to the Hadoop Distributed File System (HDFS) that makes the Hadoop storage "oscale-out" and truly distributed. An example from the "CrossBow" project is explored.

Published in: Technology
1 Comment
  • http://dbmanagement.info/Tutorials/Hadoop.htm
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Whitepaper : CHI: Hadoop's Rise in Life Sciences

  1. 1. Exploring EMC Isilon scale-out storage solutionsHadoop’s Risein Life SciencesBy John Russell, Contributing Editor, Bio•IT WorldProduced by Cambridge Healthtech Media Group
  2. 2. By now the ‘Big Data’ challenge is familiar to the entire life sciencescommunity. Modern high-throughput experimental technologies generate The Hadoop Distributed Filevast data sets that can only be tackled with high performance computing(HPC). Genomics, of course, is the leading example. At the end of 2011, System (HDFS) and computeglobal annual sequencing capacity was estimated at 13 quadrillion framework (MapReduce)bases and growing rapidly1. It’s worth noting a single base pair typicallyrepresents about 100 bytes of data (raw, analyzed, and interpreted). enable Hadoop to break extremely large data setsThe need to manage and analyze these massive data sets, not just in lifesciences but throughout all of science and industry, has spurred many new into chunks, to distribute/approaches to HPC infrastructure and led to many important IT advances, store (Map) those chunksparticularly in distributed computing. While there isn’t a single rightanswer, one approach – the Hadoop storage and compute framework – is to nodes in a cluster, andemerging as a compelling contender for use in life sciences to cope with the to gather (Reduce) resultsdeluge of data. following computation.Created in 2004 by Doug Cutting (who famously named it after his son’sstuffed elephant) and elevated to a top-level Apache Foundation projectin 2008, Hadoop is intended to run large-scale distributed data analysison commodity clusters. Cutting was initially inspired by a paper2 fromGoogle Labs describing Google’s BigTable infrastructure and MapReduceapplication layers. (For a detailed perspective see Ronald Taylor’s, Anoverview of the Hadoop/MapReduce/HBase framework and its currentapplications in bioinformatics.3)Broadly, Hadoop uses a file system (Hadoop Distributed File System(HDFS) and framework software (MapReduce) to break extremely largedata sets into chunks, to distribute/store (Map) those chunks to nodes ina cluster, and to gather (Reduce) results following computation. Hadoop’sdistinguishing feature is it automatically stores the chunks of data on thesame nodes on which they will be processed. This strategy of co-locatingof data and processing power (proximity computing) significantlyaccelerates performance and in April 2008 a Hadoop program, runningon 910-node cluster, broke a world record, sorting a terabyte of data inless than 3.5 minutes.41 DNA Sequencing Caught in Deluge of Data”, New York Times, Nov. 30, 2011, http://www.nytimes.com/2011/12/01/business/dna- sequencing-caught-in-deluge-of-data.html?_r=1&ref=science2 OSDI’04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004, http://research. google.com/archive/mapreduce.html3 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC3040523/4 “Hadoop wins Terabyte sort benchmark”, Apr 2008, Apr. 2009, http://sortbenchmark.org/YahooHadoop.pdf last accessed Dec 2011 Hadoop’s Rise in Life Sciences | 2
  3. 3. Part of the improved performance stems from MapReduce’s key:valueprogramming model which speeds up and scales up parallelized It turns out that Hadoop – a“job” execution better than many alternatives such as the GridEnginearchitecture for High Performance Computing (HPC). (One of the earliest fault-tolerant, share-nothinguse-cases of the Sun GridEngine5 HPC was the DNA sequence comparison architecture in which tasksBLAST search.) The MapReduce layer is a batch query processor withdynamic data schema and linear scaling for unstructured or semi- must have no dependencestructured data. Its data is not “normalized” (decomposition of data on each other – is aninto smaller structured relationships). Therefore higher level interpretedprogramming languages like Ruby and Python and a compiled language excellent choice for manylike C++ provide easier access to MapReduce to represent the program as life sciences applications.MapReduce “jobs”.Standard Hadoop interfaces are available via Java, C, FUSE and WebDAV.The Hadoop R (statistical language) interface, RHIPE, is also popular in thelife sciences community.It turns out that Hadoop – a fault-tolerant, share-nothing architecturein which tasks must have no dependence on each other – is anexcellent choice for many life sciences applications. This is largelybecause so much of life sciences data is semi- or unstructured file-based data and ideally suited for ‘embarrassingly parallel’ computation.Moreover, the use of commodity hardware (e.g. Linux cluster) keepscost down, and little or no hardware modification is required6.Not surprisingly life sciences organizations were among Hadoop’searliest adopters. The first large-scale MapReduce project wasinitiated by the Broad Institute (in 2008) and resulted in thecomprehensive Genome Analysis Tool Kit (GATK)7. The Hadoop“CrossBow” project from Johns Hopkins University came soon after8.5 Altschul SF, et al, “Basic local alignment search tool”. J Mol Biol 215 (3): 403–410, October 1990.6 An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics, http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC3040523/7 McKenna A, et al, “The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data”, Genome Research, 20:1297–1303, July 2010.8 http://bowtie-bio.sourceforge.net/crossbow/index.shtml Hadoop’s Rise in Life Sciences | 3
  4. 4. Here are a few current Hadoop-based bioinformatics applications9: • Crossbow. Whole genome resequencing analysis; SNP genotyping from short reads.
 • Contrail. De novo assembly from short sequencing reads.
 • Myrna. Ultrafast short read alignment and differential gene expression from large RNA-seq data sets.
 • PeakRanger. Cloud-enabled peak caller for ChIP-seq data.
 • Quake. Quality-aware detection and sequencing error correction tool.
 • BlastReduce. High-performance short read mapping.
 • CloudBLAST. Hadoop implementation of NCBI’s Blast.
 • MrsRF. Algorithm for analyzing large evolutionary trees.(For a more detailed example of Hadoop in operation see sidebar,Genomics Example: Calling SNPs with Crossbow.) Genomics Example: Calling SNPs with CrossBow Next Generation Sequencers (NGS) like Illumina Hiseq can produce data in the order of 200 billion base pairs (200 Gbp) in a single one-week run for a 60x human genome coverage, which means that each base was present on an average of 60 reads. The larger the coverage, the more statistically significant is the result. Sequence reads are much shorter than traditional “Sanger” sequencing. This data requires specialized software algorithms called “short read aligners”. CrossBow is a combination of several algorithms that provide SNP calling and short read alignment, which are common tasks in NGS. Figure 1 alongside explains the steps necessary to process genome data to look for SNPs. The Map-Sort-Reduce process is ideally suited for a Hadoop framework. The cluster as shown is a traditional N-node Hadoop cluster. All of the Hadoop features like HDFS, program management and fault tolerance are available. The Map step is the short read alignment algorithm, called BoWTie (named after the Burrows Wheeler Transform, BWT). Multiple instances of BoWTie are run in parallel in Hadoop. The input tuples (an ordered list of elements) are the sequence reads and the output tuples are the alignments of the short reads. The Sort step apportions the alignments according to a primary key (the genome partition) and sorts based on a secondary key (which is the offset for that partition). The data here are the sorted alignments. The Reduce step calls SNPs for each reference genome partition. Many parallel instances of the algorithm SOAPsnp (Short Oligonucleotide Analysis Package for SNP) run in the cluster. Input tuples are sorted alignments for a partition and the output tuples are SNP calls. Results are stored via HDFS, and then archived in SOAPsnp format.9 Got Hadoop?, Sept. 2011, Genome Technology, http://www.genomeweb.com/informatics/got-hadoop Hadoop’s Rise in Life Sciences | 4
  5. 5. After several years of steady development in academic environments,Hadoop is now poised for rapid commercialization and broader “Hadoop meets all the tenetsuptake in biopharma and healthcare. Early adoption has beenstrongest among next generation sequencing (NGS) centers where of Jim Gray’s Laws of DataNGS workflows can generate 2 TeraBytes (TB) of data per run per Engineering which have notweek per sequencer – that’s not including the raw images. For these changed in 15 years.”organizations, the need for scale-out storage that integrates withHPC is a line item requirement. Sanjay Joshi CTO, Life Sciences, EMC Isilon Storage DivisionEMC ® Isilon ®, long a leader in scale-out NAS storage solutions,understands these challenges and has provided the scale-out storagefor nearly all the workflows for all the DNA sequencer instrumentmanufacturers in the market today at more than 150 customers.Since 2008, the EMC Isilon OneFS ® storage platform has an overallinstalled base of more than 65 PetaBytes (PB). Recently, EMCintroduced the industry’s first scale-out NAS system with nativeHadoop support (via HDFS).The EMC Isilon OneFS file system now provides for connectivity tothe Hadoop Distributed File System (HDFS) just like any other sharedfile system protocol: NFS, CIFS or SMB10. This allows for the dataco-location of the storage with its compute nodes using the standardhigher-level Java application programming interface (API) to buildMapReduce “jobs”. EMC has gone one step further by combining itsOneFS-based NAS solution with EMC Greenplum ® HD, a powerfulanalytics platform, to create a Hadoop appliance. Together, the twoofferings relieve users of the burden of cobbling together various opensource Hadoop components, which sometimes proves problematic.“Hadoop meets all the tenets of Jim Gray’s Laws of DataEngineering11 which have not changed in 15 years,” says SanjayJoshi, CTO, Life Sciences, EMC Isilon Storage Division. Those tenetsinclude: scientific computing is very data intensive, with no reallimits; the solution is a scale-out architecture with distributed dataaccess; and bring computation to the data, rather than data to thecomputations.”10 Hadoop on EMC Isilon Scale Out NAS: EMC White Paper, Part Number h1052811 From Jim Gray, “Scalable Computing”, presentation at Nortel: Microsoft Research, April 1999 Hadoop’s Rise in Life Sciences | 5
  6. 6. “Isilon built the industry’s first Scale Out storage architecture. Nowwith its native and enterprise-ready HDFS protocol via OneFS andGreenPlum HD, EMC brings simplicity to Big Data in Science.”says Joshi.EMC Isilon OneFS combines the three layers of traditional storagearchitectures—the file system, volume manager, and RAID—intoone unified software layer, creating a single intelligent distributedfile system that runs on one storage cluster. Important advantages ofOneFS for Hadoop are: • Scalable: Linear scale with increasing capacity – from 18TB to 16PB in a single filesystem and a single global namespace. Scale out as needs grow, independent of the compute layer. • Predictable: Dynamic content balancing is performed as nodes are added, upgraded or capacity changes. No added management time is required since this process is simple. Storage tiers without fears based • Available: OneFS protects your data from power loss, node on performance reside in one global or disk failures, loss of quorum and storage rebuild by namespace, connected via a dedicated backend network. distributing data, metadata and parity across all nodes. It also eliminates the single point of failure of a Hadoop “Name Node”. Therefore OneFS is “self healing”. • Efficient: Compared to the average 50% efficiency of traditional RAID systems, OneFS provides over 80% efficiency, independent of CPU compute or cache. This efficiency is achieved by ‘tier’ing the process into three types as shown in the figure alongside and by the pools within these node types. This efficiency extends to the reduction from a 3x copy that Hadoop requires to the >80% efficient 1x storage via EMC Isilon’s HDFS protocol. • Enterprise-ready. Administration of the storage clusters is via an intuitive Web based UI. Connectivity to your process is through standard file protocols: CIFS, SMB, NFS, FTP/ HTTP, iSCSI and HDFS. Standardized authentication and access control is available at scale: AD, LDAP and NIS. Hadoop’s Rise in Life Sciences | 6
  7. 7. CONCLUSIONWhat began as an internal project at Google in 2004 has nowmatured into a scalable framework for two computing paradigmsthat are particularly suited for the life sciences: parallelization anddistribution. Indeed, the post-processing streaming data patterns fortext strings, clustering and sorting – the core process patterns in thelife sciences – are ideal workflows for Hadoop.Case-in-point: The CrossBow example cited earlier aligned IlluminaNGS reads for SNP calling over a ‘35x’ coverage of the human genome inunder 3 hours using a 40-node Hadoop cluster; an order of magnitudebetter than traditional HPC technology for parallel processes.The EMC Isilon OneFS distributed file system handles the Hadoopdistributed file system, HDFS, just like any other shared file system,and provides a shield for the single point of failure in Hadoop: thename node. The Hybrid Cloud model (source data mirror) withHadoop as a Service (HaaS) is the current state-of-the-art. For moreinformation visit EMC Isilon at http://www.emc.com/isilon. Summary of Hadoop Attributes: Overview • Write Once Read Many times (WORM) • Co-locates data with compute, uses higher level architecture with Java API • HDFS is a distributed file system that runs on large clusters Advantages • Uses MapReduce framework – a batch query processor, scales linearly • EMC Isilon OneFS implements HDFS and eliminates the single point of failure, the “name node” • Standard programming language development: Java, Ruby, Python, C++ create MapReduce jobs. FUSE and WebDAV interfaces provide architectural flexibility Challenges • HDFS block size is 128 MB (can be increased), therefore large numbers of small files (<8KB) reduce its performance: use Hadoop Archive (HAR) • Data coherency and latency remain issues for large scale implementations • Not suited for low-latency, “in process” use-cases like real-time, spectral or video analysis • Data transfer between Genome sequencing data sources to the Hadoop clusters in the Cloud remains an issue, the current business model is mirroring the data between source and Cloud and then utilizing Hadoop as a Service model on the mirrored data. Hadoop’s Rise in Life Sciences | 7