A framework that can be installed on a commodity Linux cluster to permit large scale distributed data analysis.
Initial version created in 2004 by Doug Cutting and since after having broad and rapidly growing user community.
Hadoop provides the robust, fault-tolerant Hadoop Distributed File System (HDFS), inspired by Google's file system, as well as a Java-based API that allows parallel processing across the nodes of the cluster using the Map-Reduce paradigm allowing -
Distributed processing of large data sets
Pluggable user code runs in generic framework
Use of code written in other languages, such as Python and C, is possible through Hadoop Streaming, a utility which allows users to create and run jobs with any executables as the mapper and/or the reducer.
Hadoop comes with Job and Task Trackers that keep track of the programs’ execution across the nodes of the cluster.
Natural Choice for:
Data Intensive Log processing
Web search indexing
Hadoop Framework A Brief Background A Brief Background
Accelerating nightly batch business processes. Since Hadoop can scale linearly, this can enable internal or external on-demand cloud farms to dynamically handle shrink performance windows and take on larger volume situations that an RDBMS just can't easily deal with.
Storage of extremely high volumes of enterprise data. The Hadoop Distributed File System is a marvel in itself and can be used to hold extremely large data sets safely on commodity hardware long term that otherwise couldn't stored or handled easily in a relational database.
HDFS creates a natural, reliable, and easy-to-use backup environment for almost any amount of data at reasonable prices considering that it's essentially a high-speed online data storage environment.
Improving the scalability of applications. Very low cost commodity hardware can be used to power Hadoop clusters since redundancy and fault resistance is built into the software instead of using expensive enterprise hardware or software alternatives with proprietary solutions.
Use of Java for data processing instead of SQL. Hadoop is a Java platform and can be used by just about anyone fluent in the language (other language options are coming available soon via APIs.)
Producing just-in-time feeds for dashboards and business intelligence.
Handling urgent, ad hoc requests for data. While certainly expensive enterprise data warehousing software can do this, Hadoop is a strong performer when it comes to quickly asking and getting answers to urgent questions involving extremely large datasets.
Turning unstructured data into relational data. While ETL tools and bulk load applications work well with smaller datasets, few can approach the data volume and performance that Hadoop can
Taking on tasks that require massive parallelism. Hadoop has been known to scale out to thousands of nodes in production environments. Moving existing algorithms, code, frameworks, and components to a highly distributed computing environment.
Hadoop Framework Leveraging Hadoop for High Performance over RDBMS Leveraging Hadoop over RDBMS
XML Logs CSV SQL Objects, JSONs Binary Hadoop Distributed File System (HDFS) M A P C R E A T I O N Reduce Commodity Server Cloud (Scale Out) Hadoop Environment RDBMS import Reporting Dash Boards BI Applications Enterprise High Volume Data In-Flow Map-Reduce Process Consume Results Hadoop Processing How it works? How it works?
Automatic & efficient parallelization / distribution
Extremely popular for analyzing large datasets in cluster environments. The success of
Stems from hiding the details of parallelization, fault tolerance, and load balancing in a simple programming framework.
Widely accepted by community:-
MapReduce preferable over a parallel RDBMS for log processing. Example:-
Big Web 2.0 companies like Facebook, Yahoo and of Google.
Traditional enterprise customers of RDBMSs, such as JP Morgan Chase, VISA, The New York Times and China Mobile have started investigating and embracing MapReduce.
More than 80 companies and organizations are listed as users of Hadoop in data analytic solutions, log event processing etc.
The IT giant, IBM engaged with a number of enterprise customers to prototype novel Hadoop-based solutions on massive amount of structured and unstructured data for their business analytics applications.
China Mobile gathers 5–8TB of call records/day. Facebook , almost 6TB of new log data collected every day, with 1.7PB of log data accumulated over time.
Just formatting and loading that much data into a parallel RDBMS in a timely manner is a challenge. Second, the log records do not always follow the same schema, This makes the lack of a rigid schema in MapReduce a feature rather than a shortcoming.
Third, all the log records within a time period are typically analyzed together, making simple scans preferable to index scans.
Fourth, log processing can be very time consuming and therefore it is important to keep the analysis job going even in the event of failures.
Joining log data with all kinds of reference data in MapReduce has emerged as an important part of analytic operations for enterprise customers, as well as Web 2.0 companies
There are separate Map and Reduce steps, each step done in parallel, each operating on sets of key-value pairs.
Program execution is divided into a Map and a Reduce stage, separated by data transfer between nodes in the cluster. So we have this workflow: Input -> Map() -> Copy()/Sort() -> Reduce() ->Output. In the first stage, a node executes a Map function on a section of the input data. Map output is a set of records in the form of key-value pairs, stored on that node.
The records for any given key – possibly spread across many nodes – are aggregated at the node running the Reducer for that key.
This involves data transfer between machines. This second Reduce stage is blocked from progressing until all the data from the Map stage has been transferred to the appropriate machine.
The Reduce stage produces another set of key-value pairs, as final output. This is a simple programming model, restricted to use of key-value pairs, but a surprising number of tasks and algorithms will fit into this framework.
Also, while Hadoop is currently primarily used for batch analysis of very large data sets, nothing precludes use of Hadoop for computationally intensive analyses, e.g., the Mahout machine learning project described below.
Hadoop Processing Pig – High Level Data Flow Language Pig – High Level Data Flow Language
Pig is a high-level data-flow language (Pig Latin) and execution framework whose compiler produces sequences of Map/Reduce programs for execution within Hadoop.
Pig is designed for batch processing of data.
Pig’s infrastructure layer consists of a compiler that turns (relatively short) Pig Latin programs into sequences of MapReduce programs.
Pig is a Java client-side application, and users install locally – nothing is altered on the Hadoop cluster itself. Grunt is the Pig interactive shell.
Hadoop Processing Mahout – Extensions to Hadoop Programming Extensions to Hadoop Programming
Hadoop is not just for large-scale data processing.
Mahout is an Apache project for building scalable machine learning libraries, with most algorithms built on top of Hadoop.
Current algorithm focus areas of Mahout: clustering, classification, data mining (frequent itemset), and evolutionary programming.
Mahout clustering and classifier algorithms have direct relevance in bioinformatics - for example, for clustering of large gene expression data sets, and as classifiers for biomarker identification.
For the growing community of Python users in bioinformatics, Pydoop, a Python MapReduce and HDFS API for Hadoop that allows complete MapReduce applications to be written in Python, is available.
Hadoop Processing HBASE – Distrubited, Fault Tolerant and Scalable DB HBASE
Hbase, modeled on Google's BigTable database, HBase adds a distributed, fault-tolerant scalable database, built on top of the HDFS file system, with random real-time read/write access to data.
Each HBase table is stored as a multidimensional sparse map, with rows and columns, each cell having a time stamp. A cell value at a given row and column is by uniquely identified by (Table, Row, Column-Family:Column, Timestamp) -> Value. HBase has its own Java client API, and tables in HBase can be used both as an input source and as an output target for MapReduce jobs through TableInput/TableOutputFormat.
There is no HBase single point of failure. HBase uses Zookeeper, another Hadoop subproject, for management of partial failures.
All table accesses are by the primary key. Secondary indices are possible through additional index tables; programmers need to denormalize and replicate. There is no SQL query language in base HBase. However, there is also a Hive/HBase integration project that allows Hive QL statements access to HBase tables for both reading and inserting.
A table is made up of regions. Each region is defined by a startKey and EndKey, may live on a different node, and is made up of several HDFS files and blocks, each of which is replicated by Hadoop. Columns can be added on-the-fly to tables, with only the parent column families being fixed in a schema. Each cell is tagged by column family and column name, so programs can always identify what type of data item a given cell contains. In addition to being able to scale to petabyte size data sets, we may note the ease of integration of disparate data sources into a small number of HBase tables for building a data workspace, with different columns possibly defined (on-the-fly) for different rows in the same table. Such facility is also important. (See the biological integration discussion below.)
In addition to HBase, other scalable random access databases are now available. HadoopDB, is a hybrid of MapReduce and a standard relational db system. HadoopDB uses PostgreSQL for db layer (one PostgreSQL instance per data chunk per node), Hadoop for communication layer, and extended version of Hive for a translation layer.
Hadoop Processing Hadoop Db - Architecture Hadoop DB
A Database Connector that connects Hadoop with the single-node database systems
A Data Loader which partitions data and manages parallel loading of data into the database systems.
A Catalog which tracks locations of different data chunks, including those replicated across multiple nodes.
The SQL-MapReduce-SQL (SMS) planner which ex-tends Hive to provide a SQL interface to HadoopDB
Example System (Web Portal) Tera-Bytes of data being populated to centralized storage and processed, every week-end!
Portlet Web Services can be consumed by other Portals
Integration UI to provision real time integration with external systems via web and other channels
Provisioning for admin features based on roles and level of access
Role Management Administration Module Monitoring Control Report Configurations Reporting Business Intelligence Module Analysis Metrics Trends Application Integration Services Application Integration Portlet Integration Rules Data Sources Business Applications Infrastructure and Business Services MyASUP Portal Application Set-Up Core Framework (Logging, Exceptions, Rule Engine, Analytics, Auditing) External Apps UI Adaptation Real Time Integration Module JMS, MQ, JDBC Channels Back End Web Portal (High Level Architecture) Web Portal - High Level Architecture Which uses Hadoop, Solr and Lucene for Backend Data Processing Web Portal – Using Hadoop/Solr/Lucene Security
DB Server J2EE Application Server HTTP HTTP DB Server J2EE Application Server Apache Web Server Tomcat mod_jk Plug-In JBOSS - J2EE Application JBOSS – Portal Web Service JBOSS – jBPM JBOSS - Portal HTTP JDBC Web Portal Servers (Apache + App Server) Web Portal Deployment Landscape Shrading Function Hadoop Processing Web Portal – Deployment Landscape Web Portal – Deployment Landscape DB LB LB DB
Example – AOL Advertising Platform http://www.cloudera.com/blog/2011/02/an-emerging-data-management-architectural-pattern-behind-interactive-web-application/
AOL Advertising runs one of the largest online ad serving operations, serving billions of impressions each month to hundreds of millions of people. AOL faced three data management challenges in building their ad serving platform. There were three major challenges:-
How to analyze billions of user-related events, presented as a mix of structured and unstructured data, to infer demographic, psychographic and behavioral characteristics that are encapsulated into hundreds of millions of “cookie profiles”
How to make hundreds of millions of cookie profiles available to their ad targeting platform with sub-millisecond, random read latency
How to keep the user profiles fresh and current
The solution was to integrate two data management systems: one optimized for high-throughput data analysis (the “analytics” system), the other for low-latency random access (the “transactional” system). After analyzing alternatives, the final architecture selected paired Cloudera Distribution for Apache Hadoop ( CDH ) with Membase.
Hadoop Processing AOL Advertising – Business Case and Solution AOL Advertising – Business Case and Solution