Ronald Taylor, Ph.D. Computational Biology & Bioinformatics Group Computational Sciences & Mathematics Division Pacific Northwest National Laboratory (PNNL) Richland, Washington Email: [email_address] An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics BOSC 2010 – July 9, 2010
General background - cloud computing
High performance computing (HPC) - typically, involves distribution of work across a cluster of machines which access a shared filesystem, hosted on a storage area network. Work parallelization implemented via such APIs as the Message Passing Interface (MPI) and, more recently, Hadoop’s MapReduce API.
Cloud computing = HPC + web interface + ability to rapidly scale up and down for on-demand use. Implemented in data centers operating on clusters, possibly on massive data sets. Large data sets usefulness of Hadoop/Hbase and other NoSQL databases.
Hadoop is a Java software framework under a free license. (Use of other languages, such as Python, is possible through HadoopStreaming.) Initial version – 2004. It became a top-level Apache project in Jan 2008. Yahoo has been the largest contributor. Created by Doug Cutting, named after his son’s stuffed elephant.
Hadoop is designed to handle very large data sets, scaling into the petabyte range. It is installed on a single cluster, with MapReduce-type programs operating in parallel on data spread across the nodes of the cluster.
Web site: http://hadoop.apache.org/
Inspired by Google’s 2004 MapReduce and 2003 Google File System papers.
Greatly simplifies the development of large-scale, fault-tolerant, distributed apps on clusters of commodity machines for tasks fitting the MapReduce paradigm.
Components: the Hadoop Distributed File System (HDFS) and the MapReduce Java API for writing parallelized programs, as well as the Job and Task Trackers that keep track of the programs’ execution across the nodes of the cluster.
Data locality – Hadoop tries to automatically colocate the data with the computing node. Prime reason for Hadoop’s good performance. April 2008 – Hadoop on 910-node cluster broke world record, sorting terabyte of data in under 3.5 minutes.
Fault-tolerant, shared-nothing architecture. Tasks have no dependence on each other (with exception of mappers feeding into reducers, under Hadoop control). Hadoop can detect task failure and automatically restart programs on other healthy nodes.
Reliability – data is replicated across multiple nodes; does not require RAID storage. The single point of failure (SPOF) for the HDFS file system is the name node.
Unlike, say, MPI programming, data flow is implicit and handled automatically; does not need coding.
Hadoop (4) – MapReduce
Program execution is divided into a Map and a Reduce stage, separated by data transfer between nodes in the cluster.
Input -> Map() -> Copy/Sort -> Reduce() -> Output
In the first stage, a node executes a Map function on a section of the input data. Output is a set of records in the form of key-value pairs, stored on that node.
The records for any given key – possibly spread across many nodes – are aggregated at the node running the Reducer for that key. This involves data transfer between machines. This second stage is blocked from progressing until all the data from the Map stage has been transferred to the appropriate machine.The Reduce stage produces another set of key-value pairs, as final output.
Simple programming model, restricted to use of key-value pairs, but a surprising number of tasks / algorithms can be fit into this framework.
Current stable release of Hadoop Common: 0.20.2 (Feb 2010)
While primarily used for batch analysis of very large data sets, nothing precludes use of Hadoop for high-CPU analyses. (See Mahout project.)
HDFS file system drawbacks – handles continuous updates (write many) less well than an RDBMS. Cannot be directly mounted onto existing operating system. Hence getting data into and out of the HDFS file system can be awkward.
Project started towards end of 2006 by Chad Walters & Jim Kellerman at PowerSet.
Has its own Java Client API.
Can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat.
No HBase single point of failure. Uses Zookeeper for management of partial failures.
Data is stored in tables made up of rows and columns.
Each row has a row key and one or more column families. Each column family can have thousands of columns (fields).
The data in each cell at a given row, col is versioned. Versions, by default, are differentiated by an auto-assigned timestamp. Hence: in a sense, three-dim db.
(Table, Row, Family:Column, Timestamp) Value
All table accesses are by the primary key. (Secondary indices are possible through additional index tables. Programmers need to denormalize and replicate.)
No SQL query language.
A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop.
Columns can be added on-the-fly. Schema only defines column families. Each cell is tagged by column family and column name, so programs always know what type of data item the cell contains.
In addition to being able to scale to petabyte size data sets, note ease of integration of disparate data sources into a small number of HBase tables for building a data workspace. Such facility is also important.
Web site: http://hbase.apache.org/
Current releases: 0.20.5 and (for developers) 0.89.20100621
An Apache Hadoop subproject providing a high-level data-flow language (Pig Latin), along with an execution environment to run such Pig Latin programs.
Designed for batch processing of data.
Infrastructure layer consists of a compiler that turns (relatively short) Pig Latin programs into sequences of MapReduce programs.
Java client-side application. Install locally – nothing to alter on the cluster. Grunt: the Pig interactive shell.
Web site: http://hadoop.apache.org/pig/
Current release: 0.7.0
Data warehouse infrastructure build on top of Hadoop, on the cluster.
Developed at Facebook.
Users define tables and columns. Data is loaded into and retrieved through these tables.
Hive QL, a SQL-like query language, used in conjunction with mapReduce to create summaries, reports, analyses. Hive queries launch MapReduce jobs.
Designed for batch processing, not online transaction processing – unlike HBase, does not offer real-time queries.
Web page: http://hadoop.apache.org/hive/
Current release: 0.5.0
Thin, open source Java library that sits on top of the Hadoop MapReduce layer.
Cascading is a query processing API that allows programmers to operate at a higher level than MapReduce, and more quickly assemble complex distributed processes, and schedule them based on dependencies.
Higher level abstractions are added (Functions, Filters, Aggregators, Buffers), and MapReduce “keys” and “values” are replaced by simple field names and a data tuple model (tuple=list of values)
First public release (0.1) in January 2008.
Web site: http://www.cascading.org/
Current release: 1.1.0
NoSQL, non-Hadoop alternatives for scalability in distributed environments
Hypertable ( http://hypertable.org/ ). Another BigTable implementation, written in C++. Open source. Current release: 0.9.3.3
Cassandra ( http://cassandra.apache.org ) – Apache open source distributed db management system. Developed by Facebook, following BigTable model. Current rel: 0.6.2
Others: Project Voldemort, Dynamo (used for Amazon S3), Tokyo Tyrant. Also: CouchDB and MongoDB, representative of JSON class of document database.
Also: HadoopDB (hybrid of MapReduce and DBMS) – see http://db.cs.yale.edu/hadoopdb/hadoopdb.html. PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for translation layer.
Amazon Elastic Compute Cloud (EC2)
Web service that provides resizable compute capacity in the cloud.
Among other batch processing software, it provides Hadoop.
Web site: http://aws.amazon.com/ec2/
Deepak Singh of Amazon Web Services is very interested in bioinformatics algorithms (Hadoop-based and otherwise) running in the cloud – see his talk at Hadoop World NY Oct 2009 at http://vimeo.com/7351342
Michael Schatz (U of MD) has tested Crossbow on EC2 and believes that running such on EC2 is quite cost effective – see http: //sourceforge . net/apps/mediawiki/cloudburst-bio/index . php ? title=Hadoop_for _Computational_Biology
NoSQL db testing and benchmarking
Brian Cooper and Yahoo! colleagues have created a Yahoo! Cloud Serving Benchmark (YCSB) for comparing cloud serving systems.
They have used the YCSB framework to compare HBase, Cassandra, Yahoo’s PNUTS, and a simple shredded MySQL implementation.
Report at http://research.yahoo.com/files/ycsb.pdf , code at http:// wiki.github.com/brianfrankcooper/YCSB/
Ver 0.1.0 available for download as open source package as of 4/23/10
While Feb 2010 report is already somewhat of date (for example, HBase team has made changes), the report explains quite well the design decisions and tradeoffs (read performance vs write performance, latency vs durability, etc.) made in each system.
The YCSB framework is extensible, allowing easy definition of new workloads. As Cooper et al. note, creation of an accurate and fair benchmark, and using such to gather accurate results, is non-trivial.
Improving MapReduce - example work
Wave of research has been sparked by MapReduce / Hadoop at many universities.
Example: Bill Howe and colleagues at U of Washington. Howe: why is MapReduce/Hadoop successful? It is easy to use, flexible, fault-tolerant --> democratization of parallel computing. But the Hadoop platform (and other NoSQL scalable systems) have limitations.
Research at U of WA has led to “HaLoop” - recursive MapReduce (handles iteration control, adds caching) and “SkewReduce” - guides data partitioning using five fns (process, merge, finalize and two optional cost functions).
Example: Steve Plimpton and Karen Devine at Sandia National Lab have created an open source MapReduce-MPI library, implementing MapReduce on distributed memory parallel machines on top of standard MPI message passing. Feature: if a task can be formulated as a MapReduce, it can be performed in parallel without the user writing any parallel code ( http://www.sandia.gov/~sjplimp/ mapreduce.html)
Hadoop Use in Machine Learning
Hadoop is not just for large-scale data processing.
“ Mahout” is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop.
Current algorithm focus areas: clustering, classification, data mining (frequent itemset), evolutionary programming.
Web site: http://lucene.apache.org/mahout/
Current release: 0.3 (March 2010)
Other Hadoop MapReduce-based clustering work has been explored by M. Ngazimbi (2009 M.S. thesis, Boise State U.) and by K. Heafield at Google (“Hadoop Design and k-Means Clustering, Jan 15 2008 talk)
We may envisage construction using Hadoop of large knowledgebases on a cluster across the distributed file system.
Clojure is a important new functional language (Lisp-like) operating on the JVM which can easily call Java libs. Among other uses, Clojure (like Lisp) is employed for AI and knowledgebase development.
In particular, U of Colorado School of Med lab is using Clojure for building a extremely large biological KB using Franz Lisp’s AllegroGraph as backend, storing RDF-based triple-stores (NIH funded).
But Java library is available for tying Closure progs to MapReduce code running on Hadoop HDFS - see http://stuartsierra.com/software/clojure-hadoop . Hence - substituting Hadoop for AllegroGraph is possible.
U.S. Dept of Energy will be funding work on construction of large biological knowledgebases in the coming fiscal year.
Hadoop in Bioinformatics - today’s talks
We have several talks in this session exploring use of Hadoop and HBase in bioinformatics.
Judy Qiu (Indiana U.) will compare bioinformatics algorithms parallellized in Hadoop with implementations in MPI and Microsoft Dryad.
Ben Langmead (U. of Maryland) will describe Crossbow and Myrna, tools for analysis of very large sequencing data sets. He is first author on both.
Brian O’Connor (Lineberger Comp Cancer Center, UNC Chapel Hill) will describe the the use of HBase as a scalable backend for the SeqWare Query Engine.
Also, M. Hanna (Broad Institute) will talk about the design of the Genome Analysis Toolkit, which creates a framework that supports MapReduce programming.
Bioinformatics - work at University of Maryland (Michael C. Schatz, Ben Langmead, and colleagues)
Cloudburst - algorithm by Michael C. Schatz for mapping of next-gen sequencing data to a reference genome.
First bioinformatics tool of any significance that employed Hadoop. It put Hadoop “on the map” in bioinformatics.
Paper: Michael C. Schatz, CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25:11, 1363-1369, May 2009. Excellent starting point for not just details of Cloudburst, but also for short coherent descriptions of such mapping algorithms in general and of Hadoop.
Web site: http://sourceforge.net/projects/cloudburst-bio/
Bioinformatics - work at University of Maryland (Michael C. Schatz, Ben Langmead, and colleagues) - continued
Following up on Cloudburst, U of Maryland has developed a suite of algorithms for analysis of next gen sequencing data.
Crossbow (Ben Langmead, Michael Schatz) uses Hadoop for its calculations for SNP genotyping from short reads. Current release: 0.12.0 (Dec 09). http://bowtie-bio.sf.net/crossbow , http://bowtie-bio.sourceforge.net/crossbow/index.shtml
Contrail (Michael Schatz, Dan Sommer, David Kelley, and Mihai Pop) uses Hadoop for de novo assembly from short reads (without a ref genome), scaling up de Brujin graph construction. http://sourceforge.net/ apps/mediawiki/contrail-bio/index.php?title=Contrail
Myrna (Ben Langmead, Kasper Hansen, Jeff Leek) uses Bowtie, another U of MD tool and R/Bioconductor for calculating differential gene expression from large RNA-seq data sets. When running on a cluster, uses Hadoop. First public release: May 2010, ver 1.0.0-beta2. http://bowtie-bio.sourceforge.net/myrna/index.shtml
Bioinformatics - work at Indiana University (Judy Qiu and her colleagues)
Indiana group has performed comparisons between MPI, Dryad (Microsoft), Azure (Microsoft), and Hadoop MapReduce.
In order to measure relative performance, three bioinformatics apps were implemented in each platform: an EST seq assembly program, a stats package to identify HLA-associated viral evolution, and a pairwise Alu gene alignment algorithm. Performance eval based on runs on 768 core Windows server and an Azure cloud.
Paper: Xiaohong Qui, Jaliya Ekanayake, et al. Cloud technologies for bioinformatics applications. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (Portland, Oregon, November 16 - 16, 2009). MTAGS '09
In addition, J. Ekanayake has developed Twistor, a lightweight open source MapReduce runtime to extend the applicability of MapReduce to more classes of applications. http://www.iterativemapreduce.org/
Bioinformatics - BLAST and GSEA in Hadoop
M. Gaggero and colleagues in the Distributed Computing Group at the Center for Advanced Studies, Research and Development in Sardinia, have reported on implementing BLAST and Gene Set Enrichment Analysis (GSEA) in Hadoop.
BLAST was implemented using a Python wrapper for the NCBI C++ Toolkit and Hadoop Streaming to build an executable mapper for BLAST.
GSEA was implemented using rewritten functions in Python and used with Hadoop Streaming for the MapReduce version.
Now working on development of Biodoop, a suite of parallel bioinformatics applications based upon Hadoop, said suite consisting of three qualitatively different algorithms: BLAST, GSEA and GRAMMAR. The latter has been originally implemented as part of the GenABEL R package.
Results deemed “very promising”, MapReduce a “versatile framework” that could have “a wide range of bioinformatics application”. Paper: “Parallelizing bioinformatics applications with MapReduce” by M. Gaggero, S. Leo, et al. (2008, CRS4, Edificio 1, Ricerche, Pula, Italy).
Bioinformatics - CloudBLAST (2008)
Andrea Matsunaga and colleagues at the University of Florida have created a parallelized version of the NCBI BLAST2 algorithm (BLAST 2.2.18) using Hadoop.
Parallelization approach segmented the input sequences and ran multiple instances of the unmodified NCBI BLAST2 on each segment, using Hadoop’s streaming extension.
Results across multiple input sets were compared against the publicly available version of mpiBLAST, a leading parallel verion of BLAST. CloudBLAST exhibited better performance while also having advantages in simpler development and sustainability.
Matsunaga et al. concluded that for apps that can fit into the MapReduce paradigm, use of Hadoop brings significant advantages in terms of management of failures, data, and jobs. Also, for such an app “the CloudBLAST case study suggests that few (if any) performance gains would result from using a different approach that requires reprogramming”.
Paper: Andréa Matsunaga, Maurício Tsugawa and José Fortes, CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications, Fourth IEEE International Conference on eScience, 2008
Bioinformatics work at PNNL Using Hadoop/HBase
Project for a distributed parallel processing systems biology platform based on the Hadoop/MapReduce/HBase framework.
For our national user facility at Pacific Northwest National Lab (PNNL), we are to develop a scientific data management system can scale into the petabyte range, that will accurately and reliably store data acquired from our various instruments, and that will store the output of analysis software and relevant metadata, all in one central distributed file system.
As a small pilot project for such an effort, work is now starting on a prototype data repository, i.e., a workspace for integration of high-throughput transcriptomics and proteomics data. This database will have the capacity to store very large amounts of data from mass spectrometry-based proteomics experiments as well as from next-gen high throughput sequencing platforms.
The system will be based on the robust, fault-tolerant Hadoop HDFS distributed file system running on a Linux cluster, with random access added by the HBase database layer. The pilot repository will function as a data warehouse for reports and as a platform for experimental analysis, with such reporting and analyses being implemented in MapReduce programs created using the Hadoop API to run in parallel on the cluster. (Cluster was just released to us - Hadoop being installed this week. Check back with me in four months for an update.)
Recent Hadoop / bioinformatics papers
Michael C. Schatz, CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics, 25:11, 1363-1369, May 2009
Ben Langmead, Michael C. Schatz, Jimmy Lin, Mihai Pop and Steven L. Salzberg. 2009. Searching for SNPs with cloud computing. Genome Biology 10:R134 (on Crossbow)
Qiu, X., Ekanayake, J., Beason, S., Gunarathne, T., Fox, G., Barga, R., and Gannon, D. 2009. Cloud technologies for bioinformatics applications. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers (Portland, Oregon, November 16 - 16, 2009). MTAGS '09
Leo, S., Santoni, F., and Zanetti, G. 2009. Biodoop: Bioinformatics on Hadoop. In Proceedings of the 2009 international Conference on Parallel Processing Workshops (September 22 - 25, 2009)
Massimo Gaggero, Simone Leo, Simone Manca, Federico Santoni, Omar Schiaratura, Gianluigi Zanetti, Parallelizing bioinformatics applications with MapReduce, Cloud Computing and Its Applications, 2008
Andréa Matsunaga, Maurício Tsugawa and José Fortes, CloudBLAST: Combining MapReduce and Virtualization on Distributed Resources for Bioinformatics Applications, Fourth IEEE International Conference on eScience, 2008