Big Data - Hadoop Ecosystem

Big Data – Hadoop Ecosystem
Nuria de las Heras
Big Data – Hadoop Ecosystem
21 May 2015

Table of Content
A. Framework Ecosystem – Hadoop Ecosystem.......................................... 5
1.1. Tools for working with Hadoop..................................................................................... 6
1.1.1. No SQL Databases................................................................................................ 6
1.1.1.1. MongoDB ......................................................................................................... 7
1.1.1.2. Cassandra........................................................................................................ 8
1.1.1.3. HBase.............................................................................................................. 9
1.1.1.4. Zookeeper ..................................................................................................... 10
1.1.2. Map - Reduce.................................................................................................... 12
1.1.2.1. Hive............................................................................................................... 13
1.1.2.2. Impala ........................................................................................................... 14
1.1.2.3. Pig ................................................................................................................ 15
1.1.2.4. Cascading...................................................................................................... 15
1.1.2.5. Flume ............................................................................................................ 16
1.1.2.6. Chukwa ......................................................................................................... 17
1.1.2.7. Sqoop............................................................................................................ 18
1.1.2.8. Oozie ............................................................................................................ 18
1.1.2.9. HCatalog........................................................................................................ 19
1.1.3. Machine learning ............................................................................................... 21
1.1.3.1. WEKA............................................................................................................. 21
1.1.3.2. Mahout.......................................................................................................... 22
1.1.4. Visualization...................................................................................................... 23
1.1.4.1. Fusion Tables................................................................................................. 23
1.1.4.2. Tableau......................................................................................................... 24
1.1.5. Search............................................................................................................... 25
1.1.5.1. Lucene........................................................................................................... 25
1.1.5.2. Solr ............................................................................................................... 25

List of Tables
Table 1: No SQL Databases ..................................................................................................11
Table 2: Map – Reduce..........................................................................................................21
Table 3: Machine learning ....................................................................................................23
Table 4: Visualization ...........................................................................................................24
Table 5: Search ..................................................................................................................... 26
Table 6: Table 1........................................................................ Error! Bookmark not defined.
Table 7: Table 2........................................................................ Error! Bookmark not defined.
Table 8: Risks and Mitigation Graph ........................................ Error! Bookmark not defined.
List of Figures
Figure 1: Hadoop Ecosystem ...................................................................................................6
Figure 2: No SQL Databases Ecosystem .................................................................................. 7
Figure 3: Map – Reduce Ecosystem........................................................................................ 13
Figure 4: Figure 1....................................................................... Error! Bookmark not defined.
Figure 5: Graph 1....................................................................... Error! Bookmark not defined.
Figure 6: Graph 2....................................................................... Error! Bookmark not defined.

Revision History
Date Version Description Author
0.0 Nuria de las Heras

A. Framework Ecosystem – Hadoop Ecosystem
The Hadoop platform consists of two key services: a reliable, distributed file system called
Hadoop Distributed File System (HDFS) and the high-performance parallel data processing
engine called Hadoop MapReduce.
The combination of HDFS and MapReduce provides a software framework for processing vast
amounts of data in parallel on large clusters of commodity hardware (potentially scaling to
thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic processing
framework designed to execute queries and other batch read operations against massive
datasets that can scale from tens of terabytes to petabytes in size.
When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and
MapReduce, it soon became clear that Hadoop was not simply another application or service,
but a platform around which an entire ecosystem of capabilities could be built. Since then,
dozens of self-standing software projects have sprung into being around Hadoop, each
addressing a variety of problem spaces and meeting different needs.
The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex, evolving, and not
easily parceled into neat categories. Simply keeping track of all the project names may seem
like a task of its own, but this pales in comparison to the task of tracking the functional and
architectural differences between projects. These projects are not meant to all be used
together, as parts of a single organism; some may even be seeking to solve the same
problem in different ways. What unites them is that they each seek to tap into the scalability
and power of Hadoop, particularly the HDFS component of Hadoop.

Figure 1: Hadoop Ecosystem
1.1. Tools for working with Hadoop
1.1.1. No SQL Databases
Next generation databases mostly addressing some of the points: being non-
relational, distributed, open-source and horizontally scalable. The original intention
has been modern web-scale databases. The movement began early 2009 and is
growing rapidly. Often more characteristics apply such as: schema-free, easy
replication support, simple API, eventually consistent / BASE (Basic Availability, Soft-
state, and Eventual consistency - not ACID), a huge amount of data and more. So
the misleading term "nosql" (the community now translates it mostly with "not only
sql") should be seen as an alias to something like the definition above.

Figure 2: No SQL Databases Ecosystem
1.1.1.1. MongoDB
It’s a document-oriented system, with records that look similar to JSON objects
with the ability to store and query on nested attributes.
More features:
. MongoDB is written in C++.
. It is document-oriented storage. It is assumed that documents encapsulate
and encode data in some standard formats or encodings. Encoding in use
includes XML, YAML and JSON (JavaScript Object Notation) as well as binary
forms like BSON, PDF and MS Office documents.
. Documents use BSON syntax. Data is stored and queried in BSON, think
binary-serialized JSON-like data.
. MongoDB uses collections for storing groups of data. Documents exist
inside a collection.
. Documents are schema-less. Data in MongoDB have flexible schema.
Collections do not enforce document structure.
. MongoDB supports index on any attribute, which provides high
performance read operations for frequently used queries.

. It supports replication and high availability, which means mirror across
LANs and WANs. Replica sets provide redundancy and high availability.
. Auto-sharding. Sharding (the process of storing data records across
multiple machines) solves the problem with horizontal scaling. You add
more machines to support data growth and the demand of read and write
operations.
. Querying supports rich, document-based queries.
. It provides methods to perform update operations.
. Flexible aggregation and data processing. Map-reduce operations can
handle complex aggregation tasks.
. It stores files of any size. GridFS is a specification for storing and retrieving
files that exceed the BSON-document size limit of 16 MB.
1.1.1.2. Cassandra
Cassandra is an open source distributed database management system designed
to handle large amounts of data across many servers, providing high availability
with no single point of failure. It offers robust support for clusters spanning
multiple datacenters, with asynchronous masterless replication allowing low
latency operations for all clients.
More features:
. Cassandra is written in Java.
. Decentralized. Every node in the cluster has the same role. There is no
single point of failure. Data is distributed across the cluster (so each node
contains different data), but there is no master as every node can service
any request.
. Scalability. Read and write throughput both increase linearly as new
machines are added, with no downtime or interruption to applications.
. Fault-tolerant. Data is automatically replicated to multiple nodes for fault-
tolerance. Replication across multiple data centers is supported. Failed
nodes can be replaced with no downtime.
. Tunable consistency. Cassandra's data model is a partitioned row store
with tunable consistency. For any given read or write operation, the client
application decides how consistent the requested data should be.
. MapReduce support. Cassandra has Hadoop integration, with MapReduce
support. There is support also for Apache Pig and Apache Hive.

. Query language. CQL (Cassandra Query Language) was introduced, a SQL-
like alternative to the traditional RPC interface. Language drivers are
available for Java (JDBC), Python (DBAPI2) and Node.JS (Helenus).
. Rows are organized into tables; the first component of a table's primary
key is the partition key; within a partition, rows are clustered by the
remaining columns of the key. Other columns may be indexed separately
from the primary key.
. Cassandra is frequently referred to as a “column-oriented” database.
Column families contain rows and columns. Each row is uniquely identified
by a row key. Each row has multiple columns, each of which has a name,
value, and a timestamp. Different rows in the same column family do not
have to share the same set of columns, and a column may be added to one
or multiple rows at any time.
. It does not support joins or subqueries, except for batch analysis via
Hadoop.
. It’s not relational, and it does represent its data structures in sparse
multidimensional hash tables.
1.1.1.3. HBase
It is a distributed column-oriented database built on top of HDFS, providing Big
Table-like capabilities for Hadoop. It has been designed from the ground up with
a focus on scale in every direction: tall in numbers of rows (billions), wide in
numbers of columns (millions).
HBase is at its best when it’s accessed in a distributed fashion by many clients.
It is recommended using HBase when you need random, real-time read/write
access to Big Data.
More features:
. Written in Java.
. Strongly consistent reads/writes. This makes it very suitable for tasks such
as high-speed counter aggregation.
. Automatic sharding. HBase tables are distributed on the cluster via regions,
and regions are automatically split and re-distributed as your data grows.
. Automatic Region Server failover.
. In the parlance of CAP theorem, HBase is a CP (consistency and partition
tolerance) type system.
. HBase is not relational and does not support SQL.

. It depends on ZooKeeper and by default it manages a ZooKeeper instance
as the authority on cluster state.
. MapReduce. HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink.
. Java Client API. HBase supports an easy to use Java API for programmatic
access. Tables can serve as the input and output for MapReduce jobs run in
Hadoop, and may be accessed through the Java API but also through REST,
Avro or Thrift gateway API’s.
. Operational Management. HBase provides build-in web-pages for
operational insight as well as JMX metrics.
. Block Cache (an LRU cache that contains three levels of block priority) and
Bloom Filters (a data structure designed to tell you, rapidly and memory-
efficiently, whether an element is present in a set). HBase supports a Block
Cache and Bloom Filters for high volume query optimization.
1.1.1.4. Zookeeper
ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services. All
of these kinds of services are used in some form or another by distributed
applications.
More features:
. It allows distributed processes to coordinate with each other through a
shared hierarchal namespace which is organized similarly to a standard file
system. The name space consists of data registers - called znodes, in
ZooKeeper parlance - and these are similar to files and directories. Unlike a
typical file system, which is designed for storage, ZooKeeper data is kept
in-memory, which means ZooKeeper can achieve high throughput and low
latency numbers.
. The performance aspects of ZooKeeper mean it can be used in large,
distributed systems.
. The reliability aspects keep it from being a single point of failure.
. The servers that make up the ZooKeeper service must all know about each
other. They maintain an in-memory image of state, along with a
transaction logs and snapshots in a persistent store. As long as a majority
of the servers are available, the ZooKeeper service will be available
. ZooKeeper stamps each update with a number that reflects the order of all
ZooKeeper transactions.

. It is especially fast in "read-dominant" workloads. ZooKeeper applications
run on thousands of machines, and it performs best where reads are more
common than writes, at ratios of around 10:1.
. It provides sequential consistency. Updates from a client will be applied in
the order that they went sent.
. Atomicity. Updates either succeed or fail. No partial results.
. Single system image. A client will see the same view of the service
regardless of the server that is connected to.
. Reliability. Once an update has been applied, it will persist from that time
forward until a client overwrites the update.
. Timelines. The client’s view of the system is guaranteed to be up-to-date
within a certain time bound.
. It provides a very simple programming interface.
Advantages Disadvantages
MongoDB . Open source
. Easy to “install”
. Scalable
. High performance
. Schema free
. Dynamic queries supported
. Higher chance of losing data when
adapting content and hard to retrieveit
. Tops out performance-wise at relatively
small data volumes
Cassandra . Open source
. Scalable
. High-level redundancy, failover and backup-
restorecapabilities
. It has no single point of failure
. Ability to open and deliver data in near real-time
. Supports interactiveweb-based applications
. Complex administering and managing
. Despite it supports indexes, it is
possible to havethem out-of-sync with
the data because of lack of transactions
. I has no joins
. It is not suitable for largeblobs
HBase . Open source
. Scalable
. Good solution for largescale data processing and
analysis
. Strong consistent reads and writes
. High writeperformance
. Automatic failover support between Region
Servers
. Management complexity
. Needs Zookeeper
. The HDFS Name Node and HBase
Master areSPOF (Single Point of Failure)
Zookeeper . Open source
. High performance
. Good process synchronization in the cluster
. Consistency of the configuration in the cluster
. Reliable messaging in the cluster
. Clients need to keep sending heartbeat
messages in the absence of activity
. ZooKeeper can’t make partial failures
go away, since they are intrinsic to
distributed systems
Table 1: No SQL Databases

1.1.2. Map - Reduce
MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster.
Every job in MapReduce consists of three main phases: map, shuffle, and reduce.
In the map phase, the application has the opportunity to operate on each record in
the input separately. Many maps are started at once so that while the input may be
gigabytes or terabytes in size, given enough machines, the map phase can usually
be completed in less than one minute.
For example, if you were processing web server logs for a website that required
users to log in, you might choose the user ID to be your key so that you could see
everything done by each user on your website. In the shuffle phase, which happens
after the map phase, data is collected together by the key the user has chosen and
distributed to different machines for the reduce phase. Every record for a given key
will go to the same reducer.
In the reduce phase, the application presents each key, together with all of the
records containing that key. Again this is done in parallel on many machines. After
processing each group, the reducer can write its output.
More features:
. Scale-out Architecture. Adds servers to increase processing power.
. Security & Authentication. Works with HDFS and HBase security to make
sure that only approved users can operate against the data in the system.
. Resource Manager. Employs data locality and server resources to determine
optimal computing operations.
. Optimized Scheduling. Completes jobs according to prioritization.
. Flexibility. Procedures can be written in virtually any programming
language.
. Resiliency & High Availability. Multiple job and task trackers ensure that
jobs fail independently and restart automatically.

Figure 3: Map – Reduce Ecosystem
1.1.2.1. Hive
Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in
Hadoop compatible file systems.
Because of Hadoop’s focus on large scale processing, the latency may mean that
even simple jobs take minutes to complete, so it’s not a substitute for a real-time
transactional database.
More features:
. Scalability. Scale out with more machines added dynamically to the Hadoop
cluster.
. It provides tools to enable easy data ETL.
. Indexing to provide acceleration, index type including compaction and
bitmap index.
. Different storage types such as plain text, RC File, HBase, ORC, and others.
. Metadata storage in an RDBMS, significantly reducing the time to perform
semantic checks during query execution.
. Operating on compressed data stored into Hadoop ecosystem, algorithm
including gzip, bzip2, snappy, and others.

. SQL-like queries (Hive QL), which are implicitly converted into map-reduce
jobs.
. Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools. Hive supports extending the UDF set to handle
use-cases not supported by built-in functions.
. Hive also provides query execution via MapReduce. It allows map/reduce
programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
. Hive is not designed for OLTP workloads.
. It does not offer real-time queries or row-level updates. It is best used for
batch jobs over large sets of append-only data (like web logs).
1.1.2.2. Impala
Impala is an open-source system which is an interactive/real-time SQL query
system that runs on HDFS.
As Impala supports SQL and provides real-time big data processing functionality,
it has the potential to be utilized as a business intelligence (BI) system.
Impala has been technically inspired by Google's Dremel paper. Dremel is a
scalable, interactive ad-hoc query system for analysis of read-only nested data. By
combining multi-level execution trees and columnar data layout, it is capable of
running aggregation queries over trillion-row tables in seconds. The system scales
to thousands of CPUs and petabytes of data.
The difference between Impala and Hive is whether it is real-time or not. While
Hive uses MapReduce for data access, Impala uses its distributed query engine to
minimize response time. This distributed query engine is installed on all data
nodes in the cluster.
More features:
. Nearly all of Hive’s SQL, including insert, join and subqueries.
. Query results faster than Hive.
. Easy to create and change schemas.
. Tables created with Hive can be queried with Impala.
. Support for a variety of data formats: Hadoop native (Apache Avro,
SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text
(uncompressed or LZO-compressed); and Parquet (Snappy or
uncompressed), the new state-of-the-art columnar storage format.
. Connectivity via JDBC, ODBC, Hue GUI, or command-line shell.

1.1.2.3. Pig
It is a platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables them to handle very
large data sets.
The Apache Pig project is a procedural data processing language designed for
Hadoop. It provides an engine for executing data flows in parallel on Hadoop.
More features:
. Pig can operate on data whether it has metadata or not. It can operate on
data that is relational, nested, or unstructured. And it can easily be
extended to operate on data beyond files, including key/value stores,
databases, etc.
. Intended to be a language for parallel data processing. It is not tied to one
particular parallel framework. It has been implemented first on Hadoop,
but it is not intended to be only on Hadoop.
It can also read input from and write output to sources other than HDFS.
. Designed to be easily controlled and modified by its users.
It allows integration of user code where ever possible, so it supports user
defined field transformation functions, user defined aggregates, and user
defined conditionals.
. Pig processes data quickly.
. It includes a language, Pig Latin, for expressing data flows. Pig Latin use
cases tend to fall into three separate categories: traditional extract
transform load (ETL) data pipelines, research on raw data, and iterative
processing.
Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
1.1.2.4. Cascading
Most real-world Hadoop applications are built of a series of processing steps, and
Cascading lets you define that sort of complex workflow as a program. You lay out
the logical flow of the data pipeline you need, rather than building it explicitly out
of Map-Reduce steps feeding into one another. To use it, you call a Java API,
connecting objects that represent the operations you want to perform into a
graph. The system takes that definition, does some checking and planning, and

executes it on Hadoop cluster. Developers use Cascading to create a .jar file that
describes the required processes.
There are a lot of built-in objects for common operations like sorting, grouping,
and joining, and you can write your own objects to run custom processing code.
More features:
. It is simple to build, easy to test, robust in production
. It supports optimized joins.
. Parallel running jobs.
. Creating checkpoints.
. Developers can work on different languages (java, ruby, scala, clojure).
. Support for tsv, csv, and custom delimited text files.
1.1.2.5. Flume
Flume is a distributed system for collecting log data from many sources,
aggregating it, and writing it to HDFS. It is designed to be reliable and highly
available, while providing a simple, flexible, and intuitive programming model
based on streaming data flows.
Flume maintains a central list of ongoing data flows, stored redundantly in
Zookeeper.
One very common use of Hadoop is taking web server or other logs from a large
number of machines, and periodically processing them to pull out analytics
information. The Flume project is designed to make the data gathering process
easy and scalable, by running agents on the source machines that pass the data
updates to collectors, which then aggregate them into large chunks that can be
efficiently written as HDFS files. It’s usually set up using a command-line tool that
supports common operations, like tailing a file or listening on a network socket,
and has tunable reliability guarantees that let you trade off performance and the
potential for data loss.
More features:
. Reliability (the ability to continue delivering events in the face of failures
without losing data). Flume can guarantee that all data received by an
agent node will eventually make it to the collector at the end of its flow as
long as the agent node keeps running. That is, data can be reliably
delivered to its eventual destination. Flume allows the user to specify, on a
per-flow basis, the level of reliability required. There are three supported
reliability levels: end-to-end, store on failure, best effort.
. Scalability (the ability to increase system performance linearly by adding
more resources to the system). A key performance measure in Flume is the
number or size of events entering the system and being delivered. When

load increases, it is simple to add more resources to the system in the form
of more machines to handle the increased load.
. Manageability (the ability to control data flows, monitor nodes, modify
settings, and control outputs of a large system). The Flume Master is the
point where global state such as the data flows can be managed. Via the
Flume Master, users can monitor flows and reconfigure them on the fly.
. Extensibility (the ability to add new functionality to a system). For example,
you can extend Flume by adding connectors to existing storage layers or
data platforms. This is made possible by simple interfaces, separation of
functional concerns into simple pieces, a flow specification language, and a
simple but flexible data model. Flume provides many common input and
output connectors.
1.1.2.6. Chukwa
Log processing was one of the original purposes of MapReduce. Unfortunately,
Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is
somewhat tedious and the batch nature of MapReduce makes it difficult to use
with logs that are generated incrementally across many machines. Furthermore,
HDFS still does not support appending to existing files. Chukwa is a Hadoop
subproject that bridges that gap between log handling and MapReduce. It provides
a scalable distributed system for monitoring and analysis of log-based data. Some
of the durability features include agent-side replying of data to recover from
errors.
. Collection components of Chukwa: adaptors, agents (that run on each
machine and emit data), and collectors (that receive data from the agent
and write to a stable storage).
. Chukwa includes Hadoop Infrastructure Care Center (HICC), which is a web
interface for visualizing system performance.
. Flexible and powerful toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.
. Chukwa’s reliability model supports two levels: end-to-end reliability, and
fast-path delivery, which minimizes latencies. After writing data into HDFS
Chukwa runs a MapReduce job to demultiplex the data into separate
streams.

1.1.2.7. Sqoop
It is an open-source tool that allows users to extract data from a relational
database into Hadoop for further processing. This processing can be done with
MapReduce programs or other higher-level tools such as Hive. (It’s even possible
to use Sqoop to move data from a relational database into HBase.) When the final
results of an analytic pipeline are available, Sqoop can export these results back to
the database for consumption by other clients.
More features:
. Bulk import. Sqoop can import individual tables or entire databases into
HDFS. The data is stored in the native directories and files in the HDFS file
system.
. Direct input. Sqoop can import and map SQL (relational) databases directly
into Hive and HBase.
. Data interaction. Sqoop can generate Java classes so that you can interact
with the data programmatically.
. Data export. Sqoop can export data directly from HDFS into a relational
database using a target table definition based on the specifics of the target
database.
. It integrates with Oozie.
. It is a command line interpreter.
. It comes complete with connectors to MySQL, PostgreSQL, Oracle, SQL
Server and DB2.
1.1.2.8. Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs.
Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs)
arranged in a control dependency DAG (Direct Acyclic Graph), specifying a
sequence of actions execution. This graph is specified in hPDL (a XML Process
Definition Language).
More features:
. Oozie is a scalable, reliable and extensible system.
. Oozie can detect completion of computation/processing tasks by two
different means, callbacks and polling.
. Some of the workflows are invoked on demand, but the majority of times it
is necessary to run them based on regular time intervals and/or data
availability and/or external events. The Oozie Coordinator system allows

the user to define workflow execution schedules based on these
parameters.
. It can run jobs sequentially (one after the other) and in parallel (multiple at
time).
. Oozie can also run plain java classes, Pig workflows, and interact with the
HDFS.
. Oozie provides major flexibility (start, stop, suspend and re-run jobs).
It allows you to restart from a failure (you can tell Oozie to restart a job
from a specific node in the graph or to skip specific failed nodes).
. Java Client API / Command Line Interface (launch, control and monitor jobs
from your Java Apps).
. Web Service API (you can control jobs from anywhere).
. Receive an email when a job is complete.
1.1.2.9. HCatalog
HCatalog is an abstraction for data storage and a metadata service.
It provides a set of interfaces that open up access to Hive's metastore for tools
inside and outside of the Hadoop grid.
More features:
. It presents users with a table abstraction. This frees them from knowing
where or how their data are stored.
. It allows data producers to change how they write data while still
supporting existing data in the old format so that data consumers do not
have to change their processes.
. It provides a shared schema and data model for Pig, Hive, and MapReduce.
. It provides interoperability across data processing tools such as Pig, Map
Reduce, and Hive.
. A REST interface to allow language independent access to Hive's metadata.
. HCatalog includes Hive's command line interface so that administrator can
create and drop tables, specify table parameters, etc.
. It also provides an API for storage format developers to tell HCatalog how
to read and write data stored in different formats.
. It supports RCFile (Record Columnar File), CSV (Comma Separated Values),
JSON (JavaScript Object Notation), and SequenceFile formats.
. The data model of HCatalog is similar to HBase’s data model.

Hive . Open source
. Easy data summarization
. Ad-hoc queries
. Provides Hadoop Query Language, similar to SQL
. Metadata store, which makes the lookup easy
. It is not for OLAP processing
. Data is required to be loaded from a file
Impala . Open source
. SQL operation on top of Hadoop
. Useful with HBase, Hive, Pig
. Query results faster than Hive
. Not all Hive-SQLs are supported
. You cannot create or modify a table
Pig . Open source
. Very quick for processing large stable datasets
such as meteorological trends or web-server logs
. It’s perfect for data processing that involves a
number of steps (a pipeline of processing)
. Ideal for solving problems that can be carved up,
analyzed in pieces in parallel and then put back
together (text mining, sentiment trends,
recommendation, pattern recognition)
. Pig makes it simple to build scripts to analyze
data, experimenting with approaches to identify
the best approach
. It resides on user machine, it is not necessary to
install anything in the Hadoop cluster
. It is not ideal for real-time or near real-
time processing
Cascading . Open source
. There area lot of pre-built components that can
be composed together
. Very custom operations can be written as straight
java function
. It allows to writeanalytics jobs quickly and easily
in a familiar language
. It is not the best fit for some fine-
grained, performance-critical problems
Flume . Open source
. Scalable
. Solution for data collection of all forms
. Possible sources for Flume include Avro files and
system logs
. It has a query processing engine
. It allows streaming data to be managed and
captured into Hadoop
. It does not do real-time analytics
Chukwa . Open source
. Scalable
. Comprehensive toolset for log analysis
. It has a rich metadata model
. It can collect a variety of system metrics and can
receive data via a variety of network protocols,
including syslog
. Chukwa works with an Agent-Collector
set up that works predominantly with a
single collector until specified for
multi-collector set up
. It does not have any support for gzip
feature to zip the data files before or

. It provides a framework for processing the
collected data
after storing data in the HDFS
Sqoop . Open source
. It is extensible. There are a number of third-party
companies shipping database-specific connectors
. Connector register metadata (Sqoop 2)
. Admins set policy for connection use (Sqoop 2)
. It is compatible with almost any JDBC enabled
database
. Integration with Hive and HBase
. Although Sqoop supports importing to
a Hive table/partition, it does not allow
exporting from a table or partition
Oozie . It supports: mapreduce (java, streaming, pipes),
pig, java, filesystem, ssh, sub-workflow
. It supports variables and functions
. Interval job scheduling is time & input-data-
dependent based
. All the job management happens on the
command line and the default UI is read
only and requires a non-Apache
licensed java script library that makes it
more difficult to use
HCatalog . It provides a shared schema and data model for
Pig, Hive, and MapReduce
. None found
Table 2: Map – Reduce
1.1.3. Machine learning
Machine learning is a branch of artificial intelligence that concerns the construction
and study of systems that can learn from data.
For example, a machine learning system could be trained on email messages to
learn to distinguish between spam and non-spam messages. After learning, it can
then be used to classify new email messages into spam and non-spam folders.
The core of machine learning deals with representation and generalization.
Generalization is the ability of a learning machine to perform accurately on n ew,
unseen examples/tasks after having experienced a learning data set. The training
examples come from some generally unknown probability distribution (considered
representative of the space of occurrences) and the learner has to build a general
model about this space that enables it to produce sufficiently-accurate predictions
in previously-unseen cases.
Machine learning focuses on prediction, based on known properties learned from
the training data.
1.1.3.1. WEKA
WEKA is a Java-based framework and GUI for machine learning algorithms. It
provides a plug-in architecture for researchers to add their own techniques, with a
command-line and window interface that makes it easy to apply them to your own
data. You can use it to do everything from basic clustering to advanced
classification, together with a lot of tools for visualizing your results.

It is heavily used as a teaching tool, but it also comes in extremely handy for
prototyping and experimenting outside of the classroom.
It has a strong set of preprocessing tools that make it easy to load your data in,
and then you have a large library of algorithms at your fingertips, so you can
quickly try out ideas until you find an approach that works for your problem.
The command-line interface allows you to apply exactly the same code in an
automated way for production.
More features:
. WEKA includes data preprocessing tools.
. Classification/regression algorithms.
. Clustering algorithms.
. Attribute/subset evaluators and search algorithms for feature selection.
. Algorithms for finding association rules.
. Graphical user interfaces: The Explorer (exploratory data analysis), The
Experimenter (experimental environment), and The Knowledge Flow (new
process model inspired interface).
. WEKA is platform-independent.
. It is easily useable by people who are not data mining specialists.
. Provides flexible facilities for scripting experiments.
1.1.3.2. Mahout
It is an open source machine learning library from Apache. It means primarily
recommender engines (collaborative filtering), clustering, and classification.
Mahout aims to be the machine learning tool of choice when the collection of data
to be processed is very large, perhaps far too large for a single machine.
It’s a framework of tools intended to be used and adapted by developers. In
practical terms, the framework makes it easy to use analysis techniques to
implement features such as Amazon’s “People who bought this also bought”
recommendation engine on your own site.
More features:
. Mahout is scalable.
. It supports algorithms for recommendation. For example, it takes user’s
behavior and from that tries to find items users might like.
. Algorithms for clustering. It takes e.g. text documents and groups them
into groups of topically related documents.
. Algorithms for classification. It learns from existing categorized documents
what documents of a specific category look like and is able to assign
unlabeled documents to the correct category.

. Algorithms for frequent itemset mining. It takes a set of item groups
(terms in a query session, shopping cart content) and identifies, which
individual items usually appear together.
WEKA . Free availability under the GNU General Public
License
. Portability, sinceit is fully implemented in Java
. Ease of use due to its graphical user interfaces
. It provides access to SQL databases using Java
Database Connectivity and can process the result
returned by a database query
. It is not capable of multi-relational data
mining
. Sequence modeling is not covered
. In experiments which involvea very big
data quantity (millions of instances), it
can spend many time in the processes
Mahout . Open source
. Scalable
. It can process very largedata quantities
. It has functionality for many of today’s common
machine learning tasks
. Mahout is merely a library of
algorithms, it is not a product
Table 3: Machine learning
1.1.4. Visualization
Visualization tools provide you to gain deeper insights from data stored in Hadoop.
Including those tools in analysis reveals patterns and associations that otherwise
are missed.
1.1.4.1. Fusion Tables
Google has created an integrated online system that lets you store large amounts
of data in spreadsheet-like tables and gives you tools to process and visualize the
information. It’s particularly good at turning geographic data into compelling
maps, with the ability to upload your own custom KML (XML notation for
expressing geographic annotation and visualization within Internet-based, two-
dimensional maps and three-dimensional Earth browsers) outlines for areas like
political constituencies. There is also a full set of traditional graphing tools, as well
as a wide variety of options to perform calculations on your data.
Fusion Tables is a powerful system, but it’s definitely aimed at fairly technical
users; the sheer variety of controls can be intimidating at first. If you’re looking for
a flexible tool to make sense of large amounts of data, it’s worth making the
effort.
More features:

. Fusion Tables is an experimental data visualization web application to
gather, visualize, and share larger data tables.
. Fusion Tables permit visualize bigger table data online. Filter and
summarize across hundreds of thousands of rows. Then try a chart, map,
network graph, or custom layout and embed or share it. Merge two or
three tables to generate a single visualization
. Combine with other data in the web.
. Makes a map in minutes.
. Host data online.
1.1.4.2. Tableau
Originally a traditional desktop application for drawing graphs and visualizations,
Tableau has been adding a lot of support for online publishing and content
creation. Its embedded graphs have become very popular with news organizations
on the Web, illustrating a lot of stories.
The support for geographic data isn’t as extensive as Fusion’s, but Tableau is
capable of creating some map styles that Google’s product can’t produce.
More features:
. With Tableau Public interactive visuals can be created and publish them
without the help of programmers.
. It offers hundreds of visualization types, such as maps, bar and line charts,
lists, and heat maps.
. Tableau Public is automatically touch-optimized for Android and iPad
tablets. It supports all browsers without plug-ins.
Fusion
Tables
. Good at turning geographic data into compelling
maps, with the ability to upload your own custom
KML
. The offer spatial query processing and very
thorough Google Maps integration.
. Access must be authenticated
. There is no organization to datasets
Tableau . It is fast bringing data due to its in-memory
analytical engine
. It has nativeconnectors to Cloudera Impala and
Cloudera Hadoop, DataStax Enterprise,
Hortonworks and MapR Hadoop Distribution for
Hadoop reporting and analysis
. It has powerful visualization capabilities that let
you create maps, charts and dashboards easily
. It is not open source
Table 4: Visualization

1.1.5. Search
Search is well suited to leverage a lot of different types of information, especially
unstructured information.
One of the first things any organization is going to want to do once it accumulates a
mass of Big Data is search it.
1.1.5.1. Lucene
Lucene is a Java-based search library. It has an architecture that employs best
practice relevancy ranking and querying, as well as state of the art text
compression and a partitioned index strategy to optimize both query performance
and indexing flexibility.
More features:
. Speed — sub-second query performance for most queries.
. Complete query capabilities: keyword, Boolean and +/- queries, proximity
operators, wildcards, fielded searching, term/field/document weights,
find-similar, spell-checking, multi-lingual search and others.
. Full results processing, including sorting by relevancy, date or any field,
dynamic summaries and hit highlighting.
. Portability: runs on any platform supporting Java, and indexes are portable
across platforms – you can build an index on Linux and copy it to a
Microsoft Windows machine and search it there.
. Scalability — there are production applications in the hundreds of millions
and billions of documents/records.
. Low overhead indexes and rapid incremental indexing.
1.1.5.2. Solr
Solr is a standalone enterprise search server with a REST-like API. You put
documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You
query it via HTTP GET and receive XML, JSON, CSV or binary results.
Solr is highly scalable, providing distributed search and index replication.
More features:
. Advanced full-text search capabilities.
. Optimized for high volume web traffic.
. Standards based open interfaces - XML, JSON and HTTP.
. Comprehensive HTML administration interfaces.
. Server statistics exposed over JMX for monitoring.

. Linearly scalable, auto index replication, auto failover and recovery.
. Near real-time indexing.
. Flexible and adaptable with XML configuration.
. Extensible plugin architecture.
Lucene . It is the core search library (it's a library for
indexing and searching text)
. ACID (or near ACID) is not guaranteed.
A crash while writing to Lucene index
might render it useless
Solr . It is the logical starting point for developers
building search applications
. It is good at reads
. Documents update instead of fields (so
when you havea million documents
that say "German" and should say
"French", you have to reindex a million
documents)
. It takes too long to update and commit
Table 5: Search

Big Data - Hadoop Ecosystem

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big Data - Hadoop Ecosystem

Similar to Big Data - Hadoop Ecosystem (20)

Recently uploaded

Recently uploaded (20)

Big Data - Hadoop Ecosystem