SlideShare a Scribd company logo
1 of 26
Big Data – Hadoop Ecosystem
Nuria de las Heras
Big Data – Hadoop Ecosystem
21 May 2015
Table of Content
A. Framework Ecosystem – Hadoop Ecosystem.......................................... 5
1.1. Tools for working with Hadoop..................................................................................... 6
1.1.1. No SQL Databases................................................................................................ 6
1.1.1.1. MongoDB ......................................................................................................... 7
1.1.1.2. Cassandra........................................................................................................ 8
1.1.1.3. HBase.............................................................................................................. 9
1.1.1.4. Zookeeper ..................................................................................................... 10
1.1.2. Map - Reduce.................................................................................................... 12
1.1.2.1. Hive............................................................................................................... 13
1.1.2.2. Impala ........................................................................................................... 14
1.1.2.3. Pig ................................................................................................................ 15
1.1.2.4. Cascading...................................................................................................... 15
1.1.2.5. Flume ............................................................................................................ 16
1.1.2.6. Chukwa ......................................................................................................... 17
1.1.2.7. Sqoop............................................................................................................ 18
1.1.2.8. Oozie ............................................................................................................ 18
1.1.2.9. HCatalog........................................................................................................ 19
1.1.3. Machine learning ............................................................................................... 21
1.1.3.1. WEKA............................................................................................................. 21
1.1.3.2. Mahout.......................................................................................................... 22
1.1.4. Visualization...................................................................................................... 23
1.1.4.1. Fusion Tables................................................................................................. 23
1.1.4.2. Tableau......................................................................................................... 24
1.1.5. Search............................................................................................................... 25
1.1.5.1. Lucene........................................................................................................... 25
1.1.5.2. Solr ............................................................................................................... 25
List of Tables
Table 1: No SQL Databases ..................................................................................................11
Table 2: Map – Reduce..........................................................................................................21
Table 3: Machine learning ....................................................................................................23
Table 4: Visualization ...........................................................................................................24
Table 5: Search ..................................................................................................................... 26
Table 6: Table 1........................................................................ Error! Bookmark not defined.
Table 7: Table 2........................................................................ Error! Bookmark not defined.
Table 8: Risks and Mitigation Graph ........................................ Error! Bookmark not defined.
List of Figures
Figure 1: Hadoop Ecosystem ...................................................................................................6
Figure 2: No SQL Databases Ecosystem .................................................................................. 7
Figure 3: Map – Reduce Ecosystem........................................................................................ 13
Figure 4: Figure 1....................................................................... Error! Bookmark not defined.
Figure 5: Graph 1....................................................................... Error! Bookmark not defined.
Figure 6: Graph 2....................................................................... Error! Bookmark not defined.
Revision History
Date Version Description Author
0.0 Nuria de las Heras
A. Framework Ecosystem – Hadoop Ecosystem
The Hadoop platform consists of two key services: a reliable, distributed file system called
Hadoop Distributed File System (HDFS) and the high-performance parallel data processing
engine called Hadoop MapReduce.
The combination of HDFS and MapReduce provides a software framework for processing vast
amounts of data in parallel on large clusters of commodity hardware (potentially scaling to
thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic processing
framework designed to execute queries and other batch read operations against massive
datasets that can scale from tens of terabytes to petabytes in size.
When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and
MapReduce, it soon became clear that Hadoop was not simply another application or service,
but a platform around which an entire ecosystem of capabilities could be built. Since then,
dozens of self-standing software projects have sprung into being around Hadoop, each
addressing a variety of problem spaces and meeting different needs.
The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex, evolving, and not
easily parceled into neat categories. Simply keeping track of all the project names may seem
like a task of its own, but this pales in comparison to the task of tracking the functional and
architectural differences between projects. These projects are not meant to all be used
together, as parts of a single organism; some may even be seeking to solve the same
problem in different ways. What unites them is that they each seek to tap into the scalability
and power of Hadoop, particularly the HDFS component of Hadoop.
Figure 1: Hadoop Ecosystem
1.1. Tools for working with Hadoop
1.1.1. No SQL Databases
Next generation databases mostly addressing some of the points: being non-
relational, distributed, open-source and horizontally scalable. The original intention
has been modern web-scale databases. The movement began early 2009 and is
growing rapidly. Often more characteristics apply such as: schema-free, easy
replication support, simple API, eventually consistent / BASE (Basic Availability, Soft-
state, and Eventual consistency - not ACID), a huge amount of data and more. So
the misleading term "nosql" (the community now translates it mostly with "not only
sql") should be seen as an alias to something like the definition above.
Figure 2: No SQL Databases Ecosystem
1.1.1.1. MongoDB
It’s a document-oriented system, with records that look similar to JSON objects
with the ability to store and query on nested attributes.
More features:
. MongoDB is written in C++.
. It is document-oriented storage. It is assumed that documents encapsulate
and encode data in some standard formats or encodings. Encoding in use
includes XML, YAML and JSON (JavaScript Object Notation) as well as binary
forms like BSON, PDF and MS Office documents.
. Documents use BSON syntax. Data is stored and queried in BSON, think
binary-serialized JSON-like data.
. MongoDB uses collections for storing groups of data. Documents exist
inside a collection.
. Documents are schema-less. Data in MongoDB have flexible schema.
Collections do not enforce document structure.
. MongoDB supports index on any attribute, which provides high
performance read operations for frequently used queries.
. It supports replication and high availability, which means mirror across
LANs and WANs. Replica sets provide redundancy and high availability.
. Auto-sharding. Sharding (the process of storing data records across
multiple machines) solves the problem with horizontal scaling. You add
more machines to support data growth and the demand of read and write
operations.
. Querying supports rich, document-based queries.
. It provides methods to perform update operations.
. Flexible aggregation and data processing. Map-reduce operations can
handle complex aggregation tasks.
. It stores files of any size. GridFS is a specification for storing and retrieving
files that exceed the BSON-document size limit of 16 MB.
1.1.1.2. Cassandra
Cassandra is an open source distributed database management system designed
to handle large amounts of data across many servers, providing high availability
with no single point of failure. It offers robust support for clusters spanning
multiple datacenters, with asynchronous masterless replication allowing low
latency operations for all clients.
More features:
. Cassandra is written in Java.
. Decentralized. Every node in the cluster has the same role. There is no
single point of failure. Data is distributed across the cluster (so each node
contains different data), but there is no master as every node can service
any request.
. Scalability. Read and write throughput both increase linearly as new
machines are added, with no downtime or interruption to applications.
. Fault-tolerant. Data is automatically replicated to multiple nodes for fault-
tolerance. Replication across multiple data centers is supported. Failed
nodes can be replaced with no downtime.
. Tunable consistency. Cassandra's data model is a partitioned row store
with tunable consistency. For any given read or write operation, the client
application decides how consistent the requested data should be.
. MapReduce support. Cassandra has Hadoop integration, with MapReduce
support. There is support also for Apache Pig and Apache Hive.
. Query language. CQL (Cassandra Query Language) was introduced, a SQL-
like alternative to the traditional RPC interface. Language drivers are
available for Java (JDBC), Python (DBAPI2) and Node.JS (Helenus).
. Rows are organized into tables; the first component of a table's primary
key is the partition key; within a partition, rows are clustered by the
remaining columns of the key. Other columns may be indexed separately
from the primary key.
. Cassandra is frequently referred to as a “column-oriented” database.
Column families contain rows and columns. Each row is uniquely identified
by a row key. Each row has multiple columns, each of which has a name,
value, and a timestamp. Different rows in the same column family do not
have to share the same set of columns, and a column may be added to one
or multiple rows at any time.
. It does not support joins or subqueries, except for batch analysis via
Hadoop.
. It’s not relational, and it does represent its data structures in sparse
multidimensional hash tables.
1.1.1.3. HBase
It is a distributed column-oriented database built on top of HDFS, providing Big
Table-like capabilities for Hadoop. It has been designed from the ground up with
a focus on scale in every direction: tall in numbers of rows (billions), wide in
numbers of columns (millions).
HBase is at its best when it’s accessed in a distributed fashion by many clients.
It is recommended using HBase when you need random, real-time read/write
access to Big Data.
More features:
. Written in Java.
. Strongly consistent reads/writes. This makes it very suitable for tasks such
as high-speed counter aggregation.
. Automatic sharding. HBase tables are distributed on the cluster via regions,
and regions are automatically split and re-distributed as your data grows.
. Automatic Region Server failover.
. In the parlance of CAP theorem, HBase is a CP (consistency and partition
tolerance) type system.
. HBase is not relational and does not support SQL.
. It depends on ZooKeeper and by default it manages a ZooKeeper instance
as the authority on cluster state.
. MapReduce. HBase supports massively parallelized processing via
MapReduce for using HBase as both source and sink.
. Java Client API. HBase supports an easy to use Java API for programmatic
access. Tables can serve as the input and output for MapReduce jobs run in
Hadoop, and may be accessed through the Java API but also through REST,
Avro or Thrift gateway API’s.
. Operational Management. HBase provides build-in web-pages for
operational insight as well as JMX metrics.
. Block Cache (an LRU cache that contains three levels of block priority) and
Bloom Filters (a data structure designed to tell you, rapidly and memory-
efficiently, whether an element is present in a set). HBase supports a Block
Cache and Bloom Filters for high volume query optimization.
1.1.1.4. Zookeeper
ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services. All
of these kinds of services are used in some form or another by distributed
applications.
More features:
. It allows distributed processes to coordinate with each other through a
shared hierarchal namespace which is organized similarly to a standard file
system. The name space consists of data registers - called znodes, in
ZooKeeper parlance - and these are similar to files and directories. Unlike a
typical file system, which is designed for storage, ZooKeeper data is kept
in-memory, which means ZooKeeper can achieve high throughput and low
latency numbers.
. The performance aspects of ZooKeeper mean it can be used in large,
distributed systems.
. The reliability aspects keep it from being a single point of failure.
. The servers that make up the ZooKeeper service must all know about each
other. They maintain an in-memory image of state, along with a
transaction logs and snapshots in a persistent store. As long as a majority
of the servers are available, the ZooKeeper service will be available
. ZooKeeper stamps each update with a number that reflects the order of all
ZooKeeper transactions.
. It is especially fast in "read-dominant" workloads. ZooKeeper applications
run on thousands of machines, and it performs best where reads are more
common than writes, at ratios of around 10:1.
. It provides sequential consistency. Updates from a client will be applied in
the order that they went sent.
. Atomicity. Updates either succeed or fail. No partial results.
. Single system image. A client will see the same view of the service
regardless of the server that is connected to.
. Reliability. Once an update has been applied, it will persist from that time
forward until a client overwrites the update.
. Timelines. The client’s view of the system is guaranteed to be up-to-date
within a certain time bound.
. It provides a very simple programming interface.
Advantages Disadvantages
MongoDB . Open source
. Easy to “install”
. Scalable
. High performance
. Schema free
. Dynamic queries supported
. Higher chance of losing data when
adapting content and hard to retrieveit
. Tops out performance-wise at relatively
small data volumes
Cassandra . Open source
. Scalable
. High-level redundancy, failover and backup-
restorecapabilities
. It has no single point of failure
. Ability to open and deliver data in near real-time
. Supports interactiveweb-based applications
. Complex administering and managing
. Despite it supports indexes, it is
possible to havethem out-of-sync with
the data because of lack of transactions
. I has no joins
. It is not suitable for largeblobs
HBase . Open source
. Scalable
. Good solution for largescale data processing and
analysis
. Strong consistent reads and writes
. High writeperformance
. Automatic failover support between Region
Servers
. Management complexity
. Needs Zookeeper
. The HDFS Name Node and HBase
Master areSPOF (Single Point of Failure)
Zookeeper . Open source
. High performance
. Good process synchronization in the cluster
. Consistency of the configuration in the cluster
. Reliable messaging in the cluster
. Clients need to keep sending heartbeat
messages in the absence of activity
. ZooKeeper can’t make partial failures
go away, since they are intrinsic to
distributed systems
Table 1: No SQL Databases
1.1.2. Map - Reduce
MapReduce is a programming model for processing large data sets with a parallel,
distributed algorithm on a cluster.
Every job in MapReduce consists of three main phases: map, shuffle, and reduce.
In the map phase, the application has the opportunity to operate on each record in
the input separately. Many maps are started at once so that while the input may be
gigabytes or terabytes in size, given enough machines, the map phase can usually
be completed in less than one minute.
For example, if you were processing web server logs for a website that required
users to log in, you might choose the user ID to be your key so that you could see
everything done by each user on your website. In the shuffle phase, which happens
after the map phase, data is collected together by the key the user has chosen and
distributed to different machines for the reduce phase. Every record for a given key
will go to the same reducer.
In the reduce phase, the application presents each key, together with all of the
records containing that key. Again this is done in parallel on many machines. After
processing each group, the reducer can write its output.
More features:
. Scale-out Architecture. Adds servers to increase processing power.
. Security & Authentication. Works with HDFS and HBase security to make
sure that only approved users can operate against the data in the system.
. Resource Manager. Employs data locality and server resources to determine
optimal computing operations.
. Optimized Scheduling. Completes jobs according to prioritization.
. Flexibility. Procedures can be written in virtually any programming
language.
. Resiliency & High Availability. Multiple job and task trackers ensure that
jobs fail independently and restart automatically.
Figure 3: Map – Reduce Ecosystem
1.1.2.1. Hive
Hive is a data warehouse system for Hadoop that facilitates easy data
summarization, ad-hoc queries, and the analysis of large datasets stored in
Hadoop compatible file systems.
Because of Hadoop’s focus on large scale processing, the latency may mean that
even simple jobs take minutes to complete, so it’s not a substitute for a real-time
transactional database.
More features:
. Scalability. Scale out with more machines added dynamically to the Hadoop
cluster.
. It provides tools to enable easy data ETL.
. Indexing to provide acceleration, index type including compaction and
bitmap index.
. Different storage types such as plain text, RC File, HBase, ORC, and others.
. Metadata storage in an RDBMS, significantly reducing the time to perform
semantic checks during query execution.
. Operating on compressed data stored into Hadoop ecosystem, algorithm
including gzip, bzip2, snappy, and others.
. SQL-like queries (Hive QL), which are implicitly converted into map-reduce
jobs.
. Built-in user defined functions (UDFs) to manipulate dates, strings, and
other data-mining tools. Hive supports extending the UDF set to handle
use-cases not supported by built-in functions.
. Hive also provides query execution via MapReduce. It allows map/reduce
programmers to plug in their custom mappers and reducers when it is
inconvenient or inefficient to express this logic in HiveQL.
. Hive is not designed for OLTP workloads.
. It does not offer real-time queries or row-level updates. It is best used for
batch jobs over large sets of append-only data (like web logs).
1.1.2.2. Impala
Impala is an open-source system which is an interactive/real-time SQL query
system that runs on HDFS.
As Impala supports SQL and provides real-time big data processing functionality,
it has the potential to be utilized as a business intelligence (BI) system.
Impala has been technically inspired by Google's Dremel paper. Dremel is a
scalable, interactive ad-hoc query system for analysis of read-only nested data. By
combining multi-level execution trees and columnar data layout, it is capable of
running aggregation queries over trillion-row tables in seconds. The system scales
to thousands of CPUs and petabytes of data.
The difference between Impala and Hive is whether it is real-time or not. While
Hive uses MapReduce for data access, Impala uses its distributed query engine to
minimize response time. This distributed query engine is installed on all data
nodes in the cluster.
More features:
. Nearly all of Hive’s SQL, including insert, join and subqueries.
. Query results faster than Hive.
. Easy to create and change schemas.
. Tables created with Hive can be queried with Impala.
. Support for a variety of data formats: Hadoop native (Apache Avro,
SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text
(uncompressed or LZO-compressed); and Parquet (Snappy or
uncompressed), the new state-of-the-art columnar storage format.
. Connectivity via JDBC, ODBC, Hue GUI, or command-line shell.
1.1.2.3. Pig
It is a platform for analyzing large data sets that consists of a high-level language
for expressing data analysis programs, coupled with infrastructure for evaluating
these programs. The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turns enables them to handle very
large data sets.
The Apache Pig project is a procedural data processing language designed for
Hadoop. It provides an engine for executing data flows in parallel on Hadoop.
More features:
. Pig can operate on data whether it has metadata or not. It can operate on
data that is relational, nested, or unstructured. And it can easily be
extended to operate on data beyond files, including key/value stores,
databases, etc.
. Intended to be a language for parallel data processing. It is not tied to one
particular parallel framework. It has been implemented first on Hadoop,
but it is not intended to be only on Hadoop.
It can also read input from and write output to sources other than HDFS.
. Designed to be easily controlled and modified by its users.
It allows integration of user code where ever possible, so it supports user
defined field transformation functions, user defined aggregates, and user
defined conditionals.
. Pig processes data quickly.
. It includes a language, Pig Latin, for expressing data flows. Pig Latin use
cases tend to fall into three separate categories: traditional extract
transform load (ETL) data pipelines, research on raw data, and iterative
processing.
Pig Latin includes operators for many of the traditional data operations
(join, sort, filter, etc.), as well as the ability for users to develop their own
functions for reading, processing, and writing data.
1.1.2.4. Cascading
Most real-world Hadoop applications are built of a series of processing steps, and
Cascading lets you define that sort of complex workflow as a program. You lay out
the logical flow of the data pipeline you need, rather than building it explicitly out
of Map-Reduce steps feeding into one another. To use it, you call a Java API,
connecting objects that represent the operations you want to perform into a
graph. The system takes that definition, does some checking and planning, and
executes it on Hadoop cluster. Developers use Cascading to create a .jar file that
describes the required processes.
There are a lot of built-in objects for common operations like sorting, grouping,
and joining, and you can write your own objects to run custom processing code.
More features:
. It is simple to build, easy to test, robust in production
. It supports optimized joins.
. Parallel running jobs.
. Creating checkpoints.
. Developers can work on different languages (java, ruby, scala, clojure).
. Support for tsv, csv, and custom delimited text files.
1.1.2.5. Flume
Flume is a distributed system for collecting log data from many sources,
aggregating it, and writing it to HDFS. It is designed to be reliable and highly
available, while providing a simple, flexible, and intuitive programming model
based on streaming data flows.
Flume maintains a central list of ongoing data flows, stored redundantly in
Zookeeper.
One very common use of Hadoop is taking web server or other logs from a large
number of machines, and periodically processing them to pull out analytics
information. The Flume project is designed to make the data gathering process
easy and scalable, by running agents on the source machines that pass the data
updates to collectors, which then aggregate them into large chunks that can be
efficiently written as HDFS files. It’s usually set up using a command-line tool that
supports common operations, like tailing a file or listening on a network socket,
and has tunable reliability guarantees that let you trade off performance and the
potential for data loss.
More features:
. Reliability (the ability to continue delivering events in the face of failures
without losing data). Flume can guarantee that all data received by an
agent node will eventually make it to the collector at the end of its flow as
long as the agent node keeps running. That is, data can be reliably
delivered to its eventual destination. Flume allows the user to specify, on a
per-flow basis, the level of reliability required. There are three supported
reliability levels: end-to-end, store on failure, best effort.
. Scalability (the ability to increase system performance linearly by adding
more resources to the system). A key performance measure in Flume is the
number or size of events entering the system and being delivered. When
load increases, it is simple to add more resources to the system in the form
of more machines to handle the increased load.
. Manageability (the ability to control data flows, monitor nodes, modify
settings, and control outputs of a large system). The Flume Master is the
point where global state such as the data flows can be managed. Via the
Flume Master, users can monitor flows and reconfigure them on the fly.
. Extensibility (the ability to add new functionality to a system). For example,
you can extend Flume by adding connectors to existing storage layers or
data platforms. This is made possible by simple interfaces, separation of
functional concerns into simple pieces, a flow specification language, and a
simple but flexible data model. Flume provides many common input and
output connectors.
1.1.2.6. Chukwa
Log processing was one of the original purposes of MapReduce. Unfortunately,
Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is
somewhat tedious and the batch nature of MapReduce makes it difficult to use
with logs that are generated incrementally across many machines. Furthermore,
HDFS still does not support appending to existing files. Chukwa is a Hadoop
subproject that bridges that gap between log handling and MapReduce. It provides
a scalable distributed system for monitoring and analysis of log-based data. Some
of the durability features include agent-side replying of data to recover from
errors.
. Collection components of Chukwa: adaptors, agents (that run on each
machine and emit data), and collectors (that receive data from the agent
and write to a stable storage).
. Chukwa includes Hadoop Infrastructure Care Center (HICC), which is a web
interface for visualizing system performance.
. Flexible and powerful toolkit for displaying, monitoring and analyzing
results to make the best use of the collected data.
. Chukwa’s reliability model supports two levels: end-to-end reliability, and
fast-path delivery, which minimizes latencies. After writing data into HDFS
Chukwa runs a MapReduce job to demultiplex the data into separate
streams.
1.1.2.7. Sqoop
It is an open-source tool that allows users to extract data from a relational
database into Hadoop for further processing. This processing can be done with
MapReduce programs or other higher-level tools such as Hive. (It’s even possible
to use Sqoop to move data from a relational database into HBase.) When the final
results of an analytic pipeline are available, Sqoop can export these results back to
the database for consumption by other clients.
More features:
. Bulk import. Sqoop can import individual tables or entire databases into
HDFS. The data is stored in the native directories and files in the HDFS file
system.
. Direct input. Sqoop can import and map SQL (relational) databases directly
into Hive and HBase.
. Data interaction. Sqoop can generate Java classes so that you can interact
with the data programmatically.
. Data export. Sqoop can export data directly from HDFS into a relational
database using a target table definition based on the specifics of the target
database.
. It integrates with Oozie.
. It is a command line interpreter.
. It comes complete with connectors to MySQL, PostgreSQL, Oracle, SQL
Server and DB2.
1.1.2.8. Oozie
Oozie is a workflow scheduler system to manage Hadoop jobs.
Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs)
arranged in a control dependency DAG (Direct Acyclic Graph), specifying a
sequence of actions execution. This graph is specified in hPDL (a XML Process
Definition Language).
More features:
. Oozie is a scalable, reliable and extensible system.
. Oozie can detect completion of computation/processing tasks by two
different means, callbacks and polling.
. Some of the workflows are invoked on demand, but the majority of times it
is necessary to run them based on regular time intervals and/or data
availability and/or external events. The Oozie Coordinator system allows
the user to define workflow execution schedules based on these
parameters.
. It can run jobs sequentially (one after the other) and in parallel (multiple at
time).
. Oozie can also run plain java classes, Pig workflows, and interact with the
HDFS.
. Oozie provides major flexibility (start, stop, suspend and re-run jobs).
It allows you to restart from a failure (you can tell Oozie to restart a job
from a specific node in the graph or to skip specific failed nodes).
. Java Client API / Command Line Interface (launch, control and monitor jobs
from your Java Apps).
. Web Service API (you can control jobs from anywhere).
. Receive an email when a job is complete.
1.1.2.9. HCatalog
HCatalog is an abstraction for data storage and a metadata service.
It provides a set of interfaces that open up access to Hive's metastore for tools
inside and outside of the Hadoop grid.
More features:
. It presents users with a table abstraction. This frees them from knowing
where or how their data are stored.
. It allows data producers to change how they write data while still
supporting existing data in the old format so that data consumers do not
have to change their processes.
. It provides a shared schema and data model for Pig, Hive, and MapReduce.
. It provides interoperability across data processing tools such as Pig, Map
Reduce, and Hive.
. A REST interface to allow language independent access to Hive's metadata.
. HCatalog includes Hive's command line interface so that administrator can
create and drop tables, specify table parameters, etc.
. It also provides an API for storage format developers to tell HCatalog how
to read and write data stored in different formats.
. It supports RCFile (Record Columnar File), CSV (Comma Separated Values),
JSON (JavaScript Object Notation), and SequenceFile formats.
. The data model of HCatalog is similar to HBase’s data model.
Advantages Disadvantages
Hive . Open source
. Easy data summarization
. Ad-hoc queries
. Provides Hadoop Query Language, similar to SQL
. Metadata store, which makes the lookup easy
. It is not for OLAP processing
. Data is required to be loaded from a file
Impala . Open source
. SQL operation on top of Hadoop
. Useful with HBase, Hive, Pig
. Query results faster than Hive
. Not all Hive-SQLs are supported
. You cannot create or modify a table
Pig . Open source
. Very quick for processing large stable datasets
such as meteorological trends or web-server logs
. It’s perfect for data processing that involves a
number of steps (a pipeline of processing)
. Ideal for solving problems that can be carved up,
analyzed in pieces in parallel and then put back
together (text mining, sentiment trends,
recommendation, pattern recognition)
. Pig makes it simple to build scripts to analyze
data, experimenting with approaches to identify
the best approach
. It resides on user machine, it is not necessary to
install anything in the Hadoop cluster
. It is not ideal for real-time or near real-
time processing
Cascading . Open source
. There area lot of pre-built components that can
be composed together
. Very custom operations can be written as straight
java function
. It allows to writeanalytics jobs quickly and easily
in a familiar language
. It is not the best fit for some fine-
grained, performance-critical problems
Flume . Open source
. Scalable
. Solution for data collection of all forms
. Possible sources for Flume include Avro files and
system logs
. It has a query processing engine
. It allows streaming data to be managed and
captured into Hadoop
. It does not do real-time analytics
Chukwa . Open source
. Scalable
. Comprehensive toolset for log analysis
. It has a rich metadata model
. It can collect a variety of system metrics and can
receive data via a variety of network protocols,
including syslog
. Chukwa works with an Agent-Collector
set up that works predominantly with a
single collector until specified for
multi-collector set up
. It does not have any support for gzip
feature to zip the data files before or
. It provides a framework for processing the
collected data
after storing data in the HDFS
Sqoop . Open source
. It is extensible. There are a number of third-party
companies shipping database-specific connectors
. Connector register metadata (Sqoop 2)
. Admins set policy for connection use (Sqoop 2)
. It is compatible with almost any JDBC enabled
database
. Integration with Hive and HBase
. Although Sqoop supports importing to
a Hive table/partition, it does not allow
exporting from a table or partition
Oozie . It supports: mapreduce (java, streaming, pipes),
pig, java, filesystem, ssh, sub-workflow
. It supports variables and functions
. Interval job scheduling is time & input-data-
dependent based
. All the job management happens on the
command line and the default UI is read
only and requires a non-Apache
licensed java script library that makes it
more difficult to use
HCatalog . It provides a shared schema and data model for
Pig, Hive, and MapReduce
. None found
Table 2: Map – Reduce
1.1.3. Machine learning
Machine learning is a branch of artificial intelligence that concerns the construction
and study of systems that can learn from data.
For example, a machine learning system could be trained on email messages to
learn to distinguish between spam and non-spam messages. After learning, it can
then be used to classify new email messages into spam and non-spam folders.
The core of machine learning deals with representation and generalization.
Generalization is the ability of a learning machine to perform accurately on n ew,
unseen examples/tasks after having experienced a learning data set. The training
examples come from some generally unknown probability distribution (considered
representative of the space of occurrences) and the learner has to build a general
model about this space that enables it to produce sufficiently-accurate predictions
in previously-unseen cases.
Machine learning focuses on prediction, based on known properties learned from
the training data.
1.1.3.1. WEKA
WEKA is a Java-based framework and GUI for machine learning algorithms. It
provides a plug-in architecture for researchers to add their own techniques, with a
command-line and window interface that makes it easy to apply them to your own
data. You can use it to do everything from basic clustering to advanced
classification, together with a lot of tools for visualizing your results.
It is heavily used as a teaching tool, but it also comes in extremely handy for
prototyping and experimenting outside of the classroom.
It has a strong set of preprocessing tools that make it easy to load your data in,
and then you have a large library of algorithms at your fingertips, so you can
quickly try out ideas until you find an approach that works for your problem.
The command-line interface allows you to apply exactly the same code in an
automated way for production.
More features:
. WEKA includes data preprocessing tools.
. Classification/regression algorithms.
. Clustering algorithms.
. Attribute/subset evaluators and search algorithms for feature selection.
. Algorithms for finding association rules.
. Graphical user interfaces: The Explorer (exploratory data analysis), The
Experimenter (experimental environment), and The Knowledge Flow (new
process model inspired interface).
. WEKA is platform-independent.
. It is easily useable by people who are not data mining specialists.
. Provides flexible facilities for scripting experiments.
1.1.3.2. Mahout
It is an open source machine learning library from Apache. It means primarily
recommender engines (collaborative filtering), clustering, and classification.
Mahout aims to be the machine learning tool of choice when the collection of data
to be processed is very large, perhaps far too large for a single machine.
It’s a framework of tools intended to be used and adapted by developers. In
practical terms, the framework makes it easy to use analysis techniques to
implement features such as Amazon’s “People who bought this also bought”
recommendation engine on your own site.
More features:
. Mahout is scalable.
. It supports algorithms for recommendation. For example, it takes user’s
behavior and from that tries to find items users might like.
. Algorithms for clustering. It takes e.g. text documents and groups them
into groups of topically related documents.
. Algorithms for classification. It learns from existing categorized documents
what documents of a specific category look like and is able to assign
unlabeled documents to the correct category.
. Algorithms for frequent itemset mining. It takes a set of item groups
(terms in a query session, shopping cart content) and identifies, which
individual items usually appear together.
Advantages Disadvantages
WEKA . Free availability under the GNU General Public
License
. Portability, sinceit is fully implemented in Java
. Ease of use due to its graphical user interfaces
. It provides access to SQL databases using Java
Database Connectivity and can process the result
returned by a database query
. It is not capable of multi-relational data
mining
. Sequence modeling is not covered
. In experiments which involvea very big
data quantity (millions of instances), it
can spend many time in the processes
Mahout . Open source
. Scalable
. It can process very largedata quantities
. It has functionality for many of today’s common
machine learning tasks
. Mahout is merely a library of
algorithms, it is not a product
Table 3: Machine learning
1.1.4. Visualization
Visualization tools provide you to gain deeper insights from data stored in Hadoop.
Including those tools in analysis reveals patterns and associations that otherwise
are missed.
1.1.4.1. Fusion Tables
Google has created an integrated online system that lets you store large amounts
of data in spreadsheet-like tables and gives you tools to process and visualize the
information. It’s particularly good at turning geographic data into compelling
maps, with the ability to upload your own custom KML (XML notation for
expressing geographic annotation and visualization within Internet-based, two-
dimensional maps and three-dimensional Earth browsers) outlines for areas like
political constituencies. There is also a full set of traditional graphing tools, as well
as a wide variety of options to perform calculations on your data.
Fusion Tables is a powerful system, but it’s definitely aimed at fairly technical
users; the sheer variety of controls can be intimidating at first. If you’re looking for
a flexible tool to make sense of large amounts of data, it’s worth making the
effort.
More features:
. Fusion Tables is an experimental data visualization web application to
gather, visualize, and share larger data tables.
. Fusion Tables permit visualize bigger table data online. Filter and
summarize across hundreds of thousands of rows. Then try a chart, map,
network graph, or custom layout and embed or share it. Merge two or
three tables to generate a single visualization
. Combine with other data in the web.
. Makes a map in minutes.
. Host data online.
1.1.4.2. Tableau
Originally a traditional desktop application for drawing graphs and visualizations,
Tableau has been adding a lot of support for online publishing and content
creation. Its embedded graphs have become very popular with news organizations
on the Web, illustrating a lot of stories.
The support for geographic data isn’t as extensive as Fusion’s, but Tableau is
capable of creating some map styles that Google’s product can’t produce.
More features:
. With Tableau Public interactive visuals can be created and publish them
without the help of programmers.
. It offers hundreds of visualization types, such as maps, bar and line charts,
lists, and heat maps.
. Tableau Public is automatically touch-optimized for Android and iPad
tablets. It supports all browsers without plug-ins.
Advantages Disadvantages
Fusion
Tables
. Good at turning geographic data into compelling
maps, with the ability to upload your own custom
KML
. The offer spatial query processing and very
thorough Google Maps integration.
. Access must be authenticated
. There is no organization to datasets
Tableau . It is fast bringing data due to its in-memory
analytical engine
. It has nativeconnectors to Cloudera Impala and
Cloudera Hadoop, DataStax Enterprise,
Hortonworks and MapR Hadoop Distribution for
Hadoop reporting and analysis
. It has powerful visualization capabilities that let
you create maps, charts and dashboards easily
. It is not open source
Table 4: Visualization
1.1.5. Search
Search is well suited to leverage a lot of different types of information, especially
unstructured information.
One of the first things any organization is going to want to do once it accumulates a
mass of Big Data is search it.
1.1.5.1. Lucene
Lucene is a Java-based search library. It has an architecture that employs best
practice relevancy ranking and querying, as well as state of the art text
compression and a partitioned index strategy to optimize both query performance
and indexing flexibility.
More features:
. Speed — sub-second query performance for most queries.
. Complete query capabilities: keyword, Boolean and +/- queries, proximity
operators, wildcards, fielded searching, term/field/document weights,
find-similar, spell-checking, multi-lingual search and others.
. Full results processing, including sorting by relevancy, date or any field,
dynamic summaries and hit highlighting.
. Portability: runs on any platform supporting Java, and indexes are portable
across platforms – you can build an index on Linux and copy it to a
Microsoft Windows machine and search it there.
. Scalability — there are production applications in the hundreds of millions
and billions of documents/records.
. Low overhead indexes and rapid incremental indexing.
1.1.5.2. Solr
Solr is a standalone enterprise search server with a REST-like API. You put
documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You
query it via HTTP GET and receive XML, JSON, CSV or binary results.
Solr is highly scalable, providing distributed search and index replication.
More features:
. Advanced full-text search capabilities.
. Optimized for high volume web traffic.
. Standards based open interfaces - XML, JSON and HTTP.
. Comprehensive HTML administration interfaces.
. Server statistics exposed over JMX for monitoring.
. Linearly scalable, auto index replication, auto failover and recovery.
. Near real-time indexing.
. Flexible and adaptable with XML configuration.
. Extensible plugin architecture.
Advantages Disadvantages
Lucene . It is the core search library (it's a library for
indexing and searching text)
. ACID (or near ACID) is not guaranteed.
A crash while writing to Lucene index
might render it useless
Solr . It is the logical starting point for developers
building search applications
. It is good at reads
. Documents update instead of fields (so
when you havea million documents
that say "German" and should say
"French", you have to reindex a million
documents)
. It takes too long to update and commit
Table 5: Search

More Related Content

What's hot

Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft PlatformAndrew Brust
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irdatastack
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraShrikant Samarth
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemMd. Hasan Basri (Angel)
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project ReportTushar Dalvi
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

What's hot (20)

Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Big data hadoop rdbms
Big data hadoop rdbmsBig data hadoop rdbms
Big data hadoop rdbms
 
Big Data on the Microsoft Platform
Big Data on the Microsoft PlatformBig Data on the Microsoft Platform
Big Data on the Microsoft Platform
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Big data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.irBig data vahidamiri-tabriz-13960226-datastack.ir
Big data vahidamiri-tabriz-13960226-datastack.ir
 
Hadoop
HadoopHadoop
Hadoop
 
HDFS
HDFSHDFS
HDFS
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
DSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and CassandraDSM - Comparison of Hbase and Cassandra
DSM - Comparison of Hbase and Cassandra
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Introduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-SystemIntroduction to Apache Hadoop Eco-System
Introduction to Apache Hadoop Eco-System
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Datastores
DatastoresDatastores
Datastores
 
Data Storage and Management project Report
Data Storage and Management project ReportData Storage and Management project Report
Data Storage and Management project Report
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Nosql
NosqlNosql
Nosql
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 

Viewers also liked

Research Gate Smart Grid
Research Gate Smart GridResearch Gate Smart Grid
Research Gate Smart Grid?brahim TEK?N
 
Kalkovens ontdekt bij ouwe syl
Kalkovens ontdekt bij ouwe syl Kalkovens ontdekt bij ouwe syl
Kalkovens ontdekt bij ouwe syl HPP
 
3.2 Ferkaveling fan Bildt lange ferzy 22-09-2016
3.2 Ferkaveling fan Bildt lange ferzy 22-09-20163.2 Ferkaveling fan Bildt lange ferzy 22-09-2016
3.2 Ferkaveling fan Bildt lange ferzy 22-09-2016HPP
 
SageNext Infotech LLC
SageNext Infotech LLCSageNext Infotech LLC
SageNext Infotech LLCMicky Donar
 
Gupta LJ-45 NX9_incl. CAD
Gupta LJ-45 NX9_incl. CADGupta LJ-45 NX9_incl. CAD
Gupta LJ-45 NX9_incl. CADTanay Gupta
 
Know more about magnetic bars
Know more about magnetic barsKnow more about magnetic bars
Know more about magnetic barsleyuan separator
 
The Effects of Neighborhood Change on New York City Housing Authority Residents
The Effects of Neighborhood Change on New York City Housing Authority ResidentsThe Effects of Neighborhood Change on New York City Housing Authority Residents
The Effects of Neighborhood Change on New York City Housing Authority ResidentsNYCOpportunity
 
Bewizen westhoekstins-1 kort 18 02 2016
Bewizen westhoekstins-1 kort 18 02 2016Bewizen westhoekstins-1 kort 18 02 2016
Bewizen westhoekstins-1 kort 18 02 2016HPP
 
Toponymvoorsteenachterstebalkjedef lang
Toponymvoorsteenachterstebalkjedef langToponymvoorsteenachterstebalkjedef lang
Toponymvoorsteenachterstebalkjedef langHPP
 
The Valuable Impact of CEOs Against Cancer Organization
The Valuable Impact of CEOs Against Cancer OrganizationThe Valuable Impact of CEOs Against Cancer Organization
The Valuable Impact of CEOs Against Cancer OrganizationJeffrey Goffman
 
Form ouwe syl adh fan nije saken maai 2013
Form ouwe syl adh fan nije saken maai 2013Form ouwe syl adh fan nije saken maai 2013
Form ouwe syl adh fan nije saken maai 2013HPP
 

Viewers also liked (20)

Closing the loop
Closing the loopClosing the loop
Closing the loop
 
Research Gate Smart Grid
Research Gate Smart GridResearch Gate Smart Grid
Research Gate Smart Grid
 
Kalkovens ontdekt bij ouwe syl
Kalkovens ontdekt bij ouwe syl Kalkovens ontdekt bij ouwe syl
Kalkovens ontdekt bij ouwe syl
 
workshops
workshopsworkshops
workshops
 
FlagFootballLP#1
FlagFootballLP#1FlagFootballLP#1
FlagFootballLP#1
 
Parousiash
ParousiashParousiash
Parousiash
 
Email
EmailEmail
Email
 
3.2 Ferkaveling fan Bildt lange ferzy 22-09-2016
3.2 Ferkaveling fan Bildt lange ferzy 22-09-20163.2 Ferkaveling fan Bildt lange ferzy 22-09-2016
3.2 Ferkaveling fan Bildt lange ferzy 22-09-2016
 
SageNext Infotech LLC
SageNext Infotech LLCSageNext Infotech LLC
SageNext Infotech LLC
 
Gupta LJ-45 NX9_incl. CAD
Gupta LJ-45 NX9_incl. CADGupta LJ-45 NX9_incl. CAD
Gupta LJ-45 NX9_incl. CAD
 
Know more about magnetic bars
Know more about magnetic barsKnow more about magnetic bars
Know more about magnetic bars
 
The Effects of Neighborhood Change on New York City Housing Authority Residents
The Effects of Neighborhood Change on New York City Housing Authority ResidentsThe Effects of Neighborhood Change on New York City Housing Authority Residents
The Effects of Neighborhood Change on New York City Housing Authority Residents
 
P
PP
P
 
Bewizen westhoekstins-1 kort 18 02 2016
Bewizen westhoekstins-1 kort 18 02 2016Bewizen westhoekstins-1 kort 18 02 2016
Bewizen westhoekstins-1 kort 18 02 2016
 
Toponymvoorsteenachterstebalkjedef lang
Toponymvoorsteenachterstebalkjedef langToponymvoorsteenachterstebalkjedef lang
Toponymvoorsteenachterstebalkjedef lang
 
Billy the Puppet
Billy the PuppetBilly the Puppet
Billy the Puppet
 
linkedInPortfolio
linkedInPortfoliolinkedInPortfolio
linkedInPortfolio
 
The Valuable Impact of CEOs Against Cancer Organization
The Valuable Impact of CEOs Against Cancer OrganizationThe Valuable Impact of CEOs Against Cancer Organization
The Valuable Impact of CEOs Against Cancer Organization
 
Form ouwe syl adh fan nije saken maai 2013
Form ouwe syl adh fan nije saken maai 2013Form ouwe syl adh fan nije saken maai 2013
Form ouwe syl adh fan nije saken maai 2013
 
More Strange Tales
More Strange TalesMore Strange Tales
More Strange Tales
 

Similar to Big Data - Hadoop Ecosystem

BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdfHadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdfSheetal Jain
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configurationprabakaranbrick
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 

Similar to Big Data - Hadoop Ecosystem (20)

BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
Bigdata ppt
Bigdata pptBigdata ppt
Bigdata ppt
 
Bigdata
BigdataBigdata
Bigdata
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Hadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdfHadoop tutorial-pdf.pdf
Hadoop tutorial-pdf.pdf
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop cluster configuration
Hadoop cluster configurationHadoop cluster configuration
Hadoop cluster configuration
 
Hadoop
HadoopHadoop
Hadoop
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 

Recently uploaded

Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxTanveerAhmed817946
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 

Recently uploaded (20)

VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Digi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptxDigi Khata Problem along complete plan.pptx
Digi Khata Problem along complete plan.pptx
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 

Big Data - Hadoop Ecosystem

  • 1. Big Data – Hadoop Ecosystem Nuria de las Heras Big Data – Hadoop Ecosystem 21 May 2015
  • 2. Table of Content A. Framework Ecosystem – Hadoop Ecosystem.......................................... 5 1.1. Tools for working with Hadoop..................................................................................... 6 1.1.1. No SQL Databases................................................................................................ 6 1.1.1.1. MongoDB ......................................................................................................... 7 1.1.1.2. Cassandra........................................................................................................ 8 1.1.1.3. HBase.............................................................................................................. 9 1.1.1.4. Zookeeper ..................................................................................................... 10 1.1.2. Map - Reduce.................................................................................................... 12 1.1.2.1. Hive............................................................................................................... 13 1.1.2.2. Impala ........................................................................................................... 14 1.1.2.3. Pig ................................................................................................................ 15 1.1.2.4. Cascading...................................................................................................... 15 1.1.2.5. Flume ............................................................................................................ 16 1.1.2.6. Chukwa ......................................................................................................... 17 1.1.2.7. Sqoop............................................................................................................ 18 1.1.2.8. Oozie ............................................................................................................ 18 1.1.2.9. HCatalog........................................................................................................ 19 1.1.3. Machine learning ............................................................................................... 21 1.1.3.1. WEKA............................................................................................................. 21 1.1.3.2. Mahout.......................................................................................................... 22 1.1.4. Visualization...................................................................................................... 23 1.1.4.1. Fusion Tables................................................................................................. 23 1.1.4.2. Tableau......................................................................................................... 24 1.1.5. Search............................................................................................................... 25 1.1.5.1. Lucene........................................................................................................... 25 1.1.5.2. Solr ............................................................................................................... 25
  • 3. List of Tables Table 1: No SQL Databases ..................................................................................................11 Table 2: Map – Reduce..........................................................................................................21 Table 3: Machine learning ....................................................................................................23 Table 4: Visualization ...........................................................................................................24 Table 5: Search ..................................................................................................................... 26 Table 6: Table 1........................................................................ Error! Bookmark not defined. Table 7: Table 2........................................................................ Error! Bookmark not defined. Table 8: Risks and Mitigation Graph ........................................ Error! Bookmark not defined. List of Figures Figure 1: Hadoop Ecosystem ...................................................................................................6 Figure 2: No SQL Databases Ecosystem .................................................................................. 7 Figure 3: Map – Reduce Ecosystem........................................................................................ 13 Figure 4: Figure 1....................................................................... Error! Bookmark not defined. Figure 5: Graph 1....................................................................... Error! Bookmark not defined. Figure 6: Graph 2....................................................................... Error! Bookmark not defined.
  • 4. Revision History Date Version Description Author 0.0 Nuria de las Heras
  • 5. A. Framework Ecosystem – Hadoop Ecosystem The Hadoop platform consists of two key services: a reliable, distributed file system called Hadoop Distributed File System (HDFS) and the high-performance parallel data processing engine called Hadoop MapReduce. The combination of HDFS and MapReduce provides a software framework for processing vast amounts of data in parallel on large clusters of commodity hardware (potentially scaling to thousands of nodes) in a reliable, fault-tolerant manner. Hadoop is a generic processing framework designed to execute queries and other batch read operations against massive datasets that can scale from tens of terabytes to petabytes in size. When Hadoop 1.0.0 was released by Apache in 2011, comprising mainly HDFS and MapReduce, it soon became clear that Hadoop was not simply another application or service, but a platform around which an entire ecosystem of capabilities could be built. Since then, dozens of self-standing software projects have sprung into being around Hadoop, each addressing a variety of problem spaces and meeting different needs. The so-called "Hadoop ecosystem" is, as befits an ecosystem, complex, evolving, and not easily parceled into neat categories. Simply keeping track of all the project names may seem like a task of its own, but this pales in comparison to the task of tracking the functional and architectural differences between projects. These projects are not meant to all be used together, as parts of a single organism; some may even be seeking to solve the same problem in different ways. What unites them is that they each seek to tap into the scalability and power of Hadoop, particularly the HDFS component of Hadoop.
  • 6. Figure 1: Hadoop Ecosystem 1.1. Tools for working with Hadoop 1.1.1. No SQL Databases Next generation databases mostly addressing some of the points: being non- relational, distributed, open-source and horizontally scalable. The original intention has been modern web-scale databases. The movement began early 2009 and is growing rapidly. Often more characteristics apply such as: schema-free, easy replication support, simple API, eventually consistent / BASE (Basic Availability, Soft- state, and Eventual consistency - not ACID), a huge amount of data and more. So the misleading term "nosql" (the community now translates it mostly with "not only sql") should be seen as an alias to something like the definition above.
  • 7. Figure 2: No SQL Databases Ecosystem 1.1.1.1. MongoDB It’s a document-oriented system, with records that look similar to JSON objects with the ability to store and query on nested attributes. More features: . MongoDB is written in C++. . It is document-oriented storage. It is assumed that documents encapsulate and encode data in some standard formats or encodings. Encoding in use includes XML, YAML and JSON (JavaScript Object Notation) as well as binary forms like BSON, PDF and MS Office documents. . Documents use BSON syntax. Data is stored and queried in BSON, think binary-serialized JSON-like data. . MongoDB uses collections for storing groups of data. Documents exist inside a collection. . Documents are schema-less. Data in MongoDB have flexible schema. Collections do not enforce document structure. . MongoDB supports index on any attribute, which provides high performance read operations for frequently used queries.
  • 8. . It supports replication and high availability, which means mirror across LANs and WANs. Replica sets provide redundancy and high availability. . Auto-sharding. Sharding (the process of storing data records across multiple machines) solves the problem with horizontal scaling. You add more machines to support data growth and the demand of read and write operations. . Querying supports rich, document-based queries. . It provides methods to perform update operations. . Flexible aggregation and data processing. Map-reduce operations can handle complex aggregation tasks. . It stores files of any size. GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16 MB. 1.1.1.2. Cassandra Cassandra is an open source distributed database management system designed to handle large amounts of data across many servers, providing high availability with no single point of failure. It offers robust support for clusters spanning multiple datacenters, with asynchronous masterless replication allowing low latency operations for all clients. More features: . Cassandra is written in Java. . Decentralized. Every node in the cluster has the same role. There is no single point of failure. Data is distributed across the cluster (so each node contains different data), but there is no master as every node can service any request. . Scalability. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. . Fault-tolerant. Data is automatically replicated to multiple nodes for fault- tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime. . Tunable consistency. Cassandra's data model is a partitioned row store with tunable consistency. For any given read or write operation, the client application decides how consistent the requested data should be. . MapReduce support. Cassandra has Hadoop integration, with MapReduce support. There is support also for Apache Pig and Apache Hive.
  • 9. . Query language. CQL (Cassandra Query Language) was introduced, a SQL- like alternative to the traditional RPC interface. Language drivers are available for Java (JDBC), Python (DBAPI2) and Node.JS (Helenus). . Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns may be indexed separately from the primary key. . Cassandra is frequently referred to as a “column-oriented” database. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp. Different rows in the same column family do not have to share the same set of columns, and a column may be added to one or multiple rows at any time. . It does not support joins or subqueries, except for batch analysis via Hadoop. . It’s not relational, and it does represent its data structures in sparse multidimensional hash tables. 1.1.1.3. HBase It is a distributed column-oriented database built on top of HDFS, providing Big Table-like capabilities for Hadoop. It has been designed from the ground up with a focus on scale in every direction: tall in numbers of rows (billions), wide in numbers of columns (millions). HBase is at its best when it’s accessed in a distributed fashion by many clients. It is recommended using HBase when you need random, real-time read/write access to Big Data. More features: . Written in Java. . Strongly consistent reads/writes. This makes it very suitable for tasks such as high-speed counter aggregation. . Automatic sharding. HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows. . Automatic Region Server failover. . In the parlance of CAP theorem, HBase is a CP (consistency and partition tolerance) type system. . HBase is not relational and does not support SQL.
  • 10. . It depends on ZooKeeper and by default it manages a ZooKeeper instance as the authority on cluster state. . MapReduce. HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink. . Java Client API. HBase supports an easy to use Java API for programmatic access. Tables can serve as the input and output for MapReduce jobs run in Hadoop, and may be accessed through the Java API but also through REST, Avro or Thrift gateway API’s. . Operational Management. HBase provides build-in web-pages for operational insight as well as JMX metrics. . Block Cache (an LRU cache that contains three levels of block priority) and Bloom Filters (a data structure designed to tell you, rapidly and memory- efficiently, whether an element is present in a set). HBase supports a Block Cache and Bloom Filters for high volume query optimization. 1.1.1.4. Zookeeper ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. More features: . It allows distributed processes to coordinate with each other through a shared hierarchal namespace which is organized similarly to a standard file system. The name space consists of data registers - called znodes, in ZooKeeper parlance - and these are similar to files and directories. Unlike a typical file system, which is designed for storage, ZooKeeper data is kept in-memory, which means ZooKeeper can achieve high throughput and low latency numbers. . The performance aspects of ZooKeeper mean it can be used in large, distributed systems. . The reliability aspects keep it from being a single point of failure. . The servers that make up the ZooKeeper service must all know about each other. They maintain an in-memory image of state, along with a transaction logs and snapshots in a persistent store. As long as a majority of the servers are available, the ZooKeeper service will be available . ZooKeeper stamps each update with a number that reflects the order of all ZooKeeper transactions.
  • 11. . It is especially fast in "read-dominant" workloads. ZooKeeper applications run on thousands of machines, and it performs best where reads are more common than writes, at ratios of around 10:1. . It provides sequential consistency. Updates from a client will be applied in the order that they went sent. . Atomicity. Updates either succeed or fail. No partial results. . Single system image. A client will see the same view of the service regardless of the server that is connected to. . Reliability. Once an update has been applied, it will persist from that time forward until a client overwrites the update. . Timelines. The client’s view of the system is guaranteed to be up-to-date within a certain time bound. . It provides a very simple programming interface. Advantages Disadvantages MongoDB . Open source . Easy to “install” . Scalable . High performance . Schema free . Dynamic queries supported . Higher chance of losing data when adapting content and hard to retrieveit . Tops out performance-wise at relatively small data volumes Cassandra . Open source . Scalable . High-level redundancy, failover and backup- restorecapabilities . It has no single point of failure . Ability to open and deliver data in near real-time . Supports interactiveweb-based applications . Complex administering and managing . Despite it supports indexes, it is possible to havethem out-of-sync with the data because of lack of transactions . I has no joins . It is not suitable for largeblobs HBase . Open source . Scalable . Good solution for largescale data processing and analysis . Strong consistent reads and writes . High writeperformance . Automatic failover support between Region Servers . Management complexity . Needs Zookeeper . The HDFS Name Node and HBase Master areSPOF (Single Point of Failure) Zookeeper . Open source . High performance . Good process synchronization in the cluster . Consistency of the configuration in the cluster . Reliable messaging in the cluster . Clients need to keep sending heartbeat messages in the absence of activity . ZooKeeper can’t make partial failures go away, since they are intrinsic to distributed systems Table 1: No SQL Databases
  • 12. 1.1.2. Map - Reduce MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. Every job in MapReduce consists of three main phases: map, shuffle, and reduce. In the map phase, the application has the opportunity to operate on each record in the input separately. Many maps are started at once so that while the input may be gigabytes or terabytes in size, given enough machines, the map phase can usually be completed in less than one minute. For example, if you were processing web server logs for a website that required users to log in, you might choose the user ID to be your key so that you could see everything done by each user on your website. In the shuffle phase, which happens after the map phase, data is collected together by the key the user has chosen and distributed to different machines for the reduce phase. Every record for a given key will go to the same reducer. In the reduce phase, the application presents each key, together with all of the records containing that key. Again this is done in parallel on many machines. After processing each group, the reducer can write its output. More features: . Scale-out Architecture. Adds servers to increase processing power. . Security & Authentication. Works with HDFS and HBase security to make sure that only approved users can operate against the data in the system. . Resource Manager. Employs data locality and server resources to determine optimal computing operations. . Optimized Scheduling. Completes jobs according to prioritization. . Flexibility. Procedures can be written in virtually any programming language. . Resiliency & High Availability. Multiple job and task trackers ensure that jobs fail independently and restart automatically.
  • 13. Figure 3: Map – Reduce Ecosystem 1.1.2.1. Hive Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Because of Hadoop’s focus on large scale processing, the latency may mean that even simple jobs take minutes to complete, so it’s not a substitute for a real-time transactional database. More features: . Scalability. Scale out with more machines added dynamically to the Hadoop cluster. . It provides tools to enable easy data ETL. . Indexing to provide acceleration, index type including compaction and bitmap index. . Different storage types such as plain text, RC File, HBase, ORC, and others. . Metadata storage in an RDBMS, significantly reducing the time to perform semantic checks during query execution. . Operating on compressed data stored into Hadoop ecosystem, algorithm including gzip, bzip2, snappy, and others.
  • 14. . SQL-like queries (Hive QL), which are implicitly converted into map-reduce jobs. . Built-in user defined functions (UDFs) to manipulate dates, strings, and other data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by built-in functions. . Hive also provides query execution via MapReduce. It allows map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. . Hive is not designed for OLTP workloads. . It does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs). 1.1.2.2. Impala Impala is an open-source system which is an interactive/real-time SQL query system that runs on HDFS. As Impala supports SQL and provides real-time big data processing functionality, it has the potential to be utilized as a business intelligence (BI) system. Impala has been technically inspired by Google's Dremel paper. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data. The difference between Impala and Hive is whether it is real-time or not. While Hive uses MapReduce for data access, Impala uses its distributed query engine to minimize response time. This distributed query engine is installed on all data nodes in the cluster. More features: . Nearly all of Hive’s SQL, including insert, join and subqueries. . Query results faster than Hive. . Easy to create and change schemas. . Tables created with Hive can be queried with Impala. . Support for a variety of data formats: Hadoop native (Apache Avro, SequenceFile, RCFile with Snappy, GZIP, BZIP, or uncompressed); text (uncompressed or LZO-compressed); and Parquet (Snappy or uncompressed), the new state-of-the-art columnar storage format. . Connectivity via JDBC, ODBC, Hue GUI, or command-line shell.
  • 15. 1.1.2.3. Pig It is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. The Apache Pig project is a procedural data processing language designed for Hadoop. It provides an engine for executing data flows in parallel on Hadoop. More features: . Pig can operate on data whether it has metadata or not. It can operate on data that is relational, nested, or unstructured. And it can easily be extended to operate on data beyond files, including key/value stores, databases, etc. . Intended to be a language for parallel data processing. It is not tied to one particular parallel framework. It has been implemented first on Hadoop, but it is not intended to be only on Hadoop. It can also read input from and write output to sources other than HDFS. . Designed to be easily controlled and modified by its users. It allows integration of user code where ever possible, so it supports user defined field transformation functions, user defined aggregates, and user defined conditionals. . Pig processes data quickly. . It includes a language, Pig Latin, for expressing data flows. Pig Latin use cases tend to fall into three separate categories: traditional extract transform load (ETL) data pipelines, research on raw data, and iterative processing. Pig Latin includes operators for many of the traditional data operations (join, sort, filter, etc.), as well as the ability for users to develop their own functions for reading, processing, and writing data. 1.1.2.4. Cascading Most real-world Hadoop applications are built of a series of processing steps, and Cascading lets you define that sort of complex workflow as a program. You lay out the logical flow of the data pipeline you need, rather than building it explicitly out of Map-Reduce steps feeding into one another. To use it, you call a Java API, connecting objects that represent the operations you want to perform into a graph. The system takes that definition, does some checking and planning, and
  • 16. executes it on Hadoop cluster. Developers use Cascading to create a .jar file that describes the required processes. There are a lot of built-in objects for common operations like sorting, grouping, and joining, and you can write your own objects to run custom processing code. More features: . It is simple to build, easy to test, robust in production . It supports optimized joins. . Parallel running jobs. . Creating checkpoints. . Developers can work on different languages (java, ruby, scala, clojure). . Support for tsv, csv, and custom delimited text files. 1.1.2.5. Flume Flume is a distributed system for collecting log data from many sources, aggregating it, and writing it to HDFS. It is designed to be reliable and highly available, while providing a simple, flexible, and intuitive programming model based on streaming data flows. Flume maintains a central list of ongoing data flows, stored redundantly in Zookeeper. One very common use of Hadoop is taking web server or other logs from a large number of machines, and periodically processing them to pull out analytics information. The Flume project is designed to make the data gathering process easy and scalable, by running agents on the source machines that pass the data updates to collectors, which then aggregate them into large chunks that can be efficiently written as HDFS files. It’s usually set up using a command-line tool that supports common operations, like tailing a file or listening on a network socket, and has tunable reliability guarantees that let you trade off performance and the potential for data loss. More features: . Reliability (the ability to continue delivering events in the face of failures without losing data). Flume can guarantee that all data received by an agent node will eventually make it to the collector at the end of its flow as long as the agent node keeps running. That is, data can be reliably delivered to its eventual destination. Flume allows the user to specify, on a per-flow basis, the level of reliability required. There are three supported reliability levels: end-to-end, store on failure, best effort. . Scalability (the ability to increase system performance linearly by adding more resources to the system). A key performance measure in Flume is the number or size of events entering the system and being delivered. When
  • 17. load increases, it is simple to add more resources to the system in the form of more machines to handle the increased load. . Manageability (the ability to control data flows, monitor nodes, modify settings, and control outputs of a large system). The Flume Master is the point where global state such as the data flows can be managed. Via the Flume Master, users can monitor flows and reconfigure them on the fly. . Extensibility (the ability to add new functionality to a system). For example, you can extend Flume by adding connectors to existing storage layers or data platforms. This is made possible by simple interfaces, separation of functional concerns into simple pieces, a flow specification language, and a simple but flexible data model. Flume provides many common input and output connectors. 1.1.2.6. Chukwa Log processing was one of the original purposes of MapReduce. Unfortunately, Hadoop is hard to use for this purpose. Writing MapReduce jobs to process logs is somewhat tedious and the batch nature of MapReduce makes it difficult to use with logs that are generated incrementally across many machines. Furthermore, HDFS still does not support appending to existing files. Chukwa is a Hadoop subproject that bridges that gap between log handling and MapReduce. It provides a scalable distributed system for monitoring and analysis of log-based data. Some of the durability features include agent-side replying of data to recover from errors. . Collection components of Chukwa: adaptors, agents (that run on each machine and emit data), and collectors (that receive data from the agent and write to a stable storage). . Chukwa includes Hadoop Infrastructure Care Center (HICC), which is a web interface for visualizing system performance. . Flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data. . Chukwa’s reliability model supports two levels: end-to-end reliability, and fast-path delivery, which minimizes latencies. After writing data into HDFS Chukwa runs a MapReduce job to demultiplex the data into separate streams.
  • 18. 1.1.2.7. Sqoop It is an open-source tool that allows users to extract data from a relational database into Hadoop for further processing. This processing can be done with MapReduce programs or other higher-level tools such as Hive. (It’s even possible to use Sqoop to move data from a relational database into HBase.) When the final results of an analytic pipeline are available, Sqoop can export these results back to the database for consumption by other clients. More features: . Bulk import. Sqoop can import individual tables or entire databases into HDFS. The data is stored in the native directories and files in the HDFS file system. . Direct input. Sqoop can import and map SQL (relational) databases directly into Hive and HBase. . Data interaction. Sqoop can generate Java classes so that you can interact with the data programmatically. . Data export. Sqoop can export data directly from HDFS into a relational database using a target table definition based on the specifics of the target database. . It integrates with Oozie. . It is a command line interpreter. . It comes complete with connectors to MySQL, PostgreSQL, Oracle, SQL Server and DB2. 1.1.2.8. Oozie Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie workflow is a collection of actions (i.e. Hadoop Map/Reduce jobs, Pig jobs) arranged in a control dependency DAG (Direct Acyclic Graph), specifying a sequence of actions execution. This graph is specified in hPDL (a XML Process Definition Language). More features: . Oozie is a scalable, reliable and extensible system. . Oozie can detect completion of computation/processing tasks by two different means, callbacks and polling. . Some of the workflows are invoked on demand, but the majority of times it is necessary to run them based on regular time intervals and/or data availability and/or external events. The Oozie Coordinator system allows
  • 19. the user to define workflow execution schedules based on these parameters. . It can run jobs sequentially (one after the other) and in parallel (multiple at time). . Oozie can also run plain java classes, Pig workflows, and interact with the HDFS. . Oozie provides major flexibility (start, stop, suspend and re-run jobs). It allows you to restart from a failure (you can tell Oozie to restart a job from a specific node in the graph or to skip specific failed nodes). . Java Client API / Command Line Interface (launch, control and monitor jobs from your Java Apps). . Web Service API (you can control jobs from anywhere). . Receive an email when a job is complete. 1.1.2.9. HCatalog HCatalog is an abstraction for data storage and a metadata service. It provides a set of interfaces that open up access to Hive's metastore for tools inside and outside of the Hadoop grid. More features: . It presents users with a table abstraction. This frees them from knowing where or how their data are stored. . It allows data producers to change how they write data while still supporting existing data in the old format so that data consumers do not have to change their processes. . It provides a shared schema and data model for Pig, Hive, and MapReduce. . It provides interoperability across data processing tools such as Pig, Map Reduce, and Hive. . A REST interface to allow language independent access to Hive's metadata. . HCatalog includes Hive's command line interface so that administrator can create and drop tables, specify table parameters, etc. . It also provides an API for storage format developers to tell HCatalog how to read and write data stored in different formats. . It supports RCFile (Record Columnar File), CSV (Comma Separated Values), JSON (JavaScript Object Notation), and SequenceFile formats. . The data model of HCatalog is similar to HBase’s data model.
  • 20. Advantages Disadvantages Hive . Open source . Easy data summarization . Ad-hoc queries . Provides Hadoop Query Language, similar to SQL . Metadata store, which makes the lookup easy . It is not for OLAP processing . Data is required to be loaded from a file Impala . Open source . SQL operation on top of Hadoop . Useful with HBase, Hive, Pig . Query results faster than Hive . Not all Hive-SQLs are supported . You cannot create or modify a table Pig . Open source . Very quick for processing large stable datasets such as meteorological trends or web-server logs . It’s perfect for data processing that involves a number of steps (a pipeline of processing) . Ideal for solving problems that can be carved up, analyzed in pieces in parallel and then put back together (text mining, sentiment trends, recommendation, pattern recognition) . Pig makes it simple to build scripts to analyze data, experimenting with approaches to identify the best approach . It resides on user machine, it is not necessary to install anything in the Hadoop cluster . It is not ideal for real-time or near real- time processing Cascading . Open source . There area lot of pre-built components that can be composed together . Very custom operations can be written as straight java function . It allows to writeanalytics jobs quickly and easily in a familiar language . It is not the best fit for some fine- grained, performance-critical problems Flume . Open source . Scalable . Solution for data collection of all forms . Possible sources for Flume include Avro files and system logs . It has a query processing engine . It allows streaming data to be managed and captured into Hadoop . It does not do real-time analytics Chukwa . Open source . Scalable . Comprehensive toolset for log analysis . It has a rich metadata model . It can collect a variety of system metrics and can receive data via a variety of network protocols, including syslog . Chukwa works with an Agent-Collector set up that works predominantly with a single collector until specified for multi-collector set up . It does not have any support for gzip feature to zip the data files before or
  • 21. . It provides a framework for processing the collected data after storing data in the HDFS Sqoop . Open source . It is extensible. There are a number of third-party companies shipping database-specific connectors . Connector register metadata (Sqoop 2) . Admins set policy for connection use (Sqoop 2) . It is compatible with almost any JDBC enabled database . Integration with Hive and HBase . Although Sqoop supports importing to a Hive table/partition, it does not allow exporting from a table or partition Oozie . It supports: mapreduce (java, streaming, pipes), pig, java, filesystem, ssh, sub-workflow . It supports variables and functions . Interval job scheduling is time & input-data- dependent based . All the job management happens on the command line and the default UI is read only and requires a non-Apache licensed java script library that makes it more difficult to use HCatalog . It provides a shared schema and data model for Pig, Hive, and MapReduce . None found Table 2: Map – Reduce 1.1.3. Machine learning Machine learning is a branch of artificial intelligence that concerns the construction and study of systems that can learn from data. For example, a machine learning system could be trained on email messages to learn to distinguish between spam and non-spam messages. After learning, it can then be used to classify new email messages into spam and non-spam folders. The core of machine learning deals with representation and generalization. Generalization is the ability of a learning machine to perform accurately on n ew, unseen examples/tasks after having experienced a learning data set. The training examples come from some generally unknown probability distribution (considered representative of the space of occurrences) and the learner has to build a general model about this space that enables it to produce sufficiently-accurate predictions in previously-unseen cases. Machine learning focuses on prediction, based on known properties learned from the training data. 1.1.3.1. WEKA WEKA is a Java-based framework and GUI for machine learning algorithms. It provides a plug-in architecture for researchers to add their own techniques, with a command-line and window interface that makes it easy to apply them to your own data. You can use it to do everything from basic clustering to advanced classification, together with a lot of tools for visualizing your results.
  • 22. It is heavily used as a teaching tool, but it also comes in extremely handy for prototyping and experimenting outside of the classroom. It has a strong set of preprocessing tools that make it easy to load your data in, and then you have a large library of algorithms at your fingertips, so you can quickly try out ideas until you find an approach that works for your problem. The command-line interface allows you to apply exactly the same code in an automated way for production. More features: . WEKA includes data preprocessing tools. . Classification/regression algorithms. . Clustering algorithms. . Attribute/subset evaluators and search algorithms for feature selection. . Algorithms for finding association rules. . Graphical user interfaces: The Explorer (exploratory data analysis), The Experimenter (experimental environment), and The Knowledge Flow (new process model inspired interface). . WEKA is platform-independent. . It is easily useable by people who are not data mining specialists. . Provides flexible facilities for scripting experiments. 1.1.3.2. Mahout It is an open source machine learning library from Apache. It means primarily recommender engines (collaborative filtering), clustering, and classification. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine. It’s a framework of tools intended to be used and adapted by developers. In practical terms, the framework makes it easy to use analysis techniques to implement features such as Amazon’s “People who bought this also bought” recommendation engine on your own site. More features: . Mahout is scalable. . It supports algorithms for recommendation. For example, it takes user’s behavior and from that tries to find items users might like. . Algorithms for clustering. It takes e.g. text documents and groups them into groups of topically related documents. . Algorithms for classification. It learns from existing categorized documents what documents of a specific category look like and is able to assign unlabeled documents to the correct category.
  • 23. . Algorithms for frequent itemset mining. It takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together. Advantages Disadvantages WEKA . Free availability under the GNU General Public License . Portability, sinceit is fully implemented in Java . Ease of use due to its graphical user interfaces . It provides access to SQL databases using Java Database Connectivity and can process the result returned by a database query . It is not capable of multi-relational data mining . Sequence modeling is not covered . In experiments which involvea very big data quantity (millions of instances), it can spend many time in the processes Mahout . Open source . Scalable . It can process very largedata quantities . It has functionality for many of today’s common machine learning tasks . Mahout is merely a library of algorithms, it is not a product Table 3: Machine learning 1.1.4. Visualization Visualization tools provide you to gain deeper insights from data stored in Hadoop. Including those tools in analysis reveals patterns and associations that otherwise are missed. 1.1.4.1. Fusion Tables Google has created an integrated online system that lets you store large amounts of data in spreadsheet-like tables and gives you tools to process and visualize the information. It’s particularly good at turning geographic data into compelling maps, with the ability to upload your own custom KML (XML notation for expressing geographic annotation and visualization within Internet-based, two- dimensional maps and three-dimensional Earth browsers) outlines for areas like political constituencies. There is also a full set of traditional graphing tools, as well as a wide variety of options to perform calculations on your data. Fusion Tables is a powerful system, but it’s definitely aimed at fairly technical users; the sheer variety of controls can be intimidating at first. If you’re looking for a flexible tool to make sense of large amounts of data, it’s worth making the effort. More features:
  • 24. . Fusion Tables is an experimental data visualization web application to gather, visualize, and share larger data tables. . Fusion Tables permit visualize bigger table data online. Filter and summarize across hundreds of thousands of rows. Then try a chart, map, network graph, or custom layout and embed or share it. Merge two or three tables to generate a single visualization . Combine with other data in the web. . Makes a map in minutes. . Host data online. 1.1.4.2. Tableau Originally a traditional desktop application for drawing graphs and visualizations, Tableau has been adding a lot of support for online publishing and content creation. Its embedded graphs have become very popular with news organizations on the Web, illustrating a lot of stories. The support for geographic data isn’t as extensive as Fusion’s, but Tableau is capable of creating some map styles that Google’s product can’t produce. More features: . With Tableau Public interactive visuals can be created and publish them without the help of programmers. . It offers hundreds of visualization types, such as maps, bar and line charts, lists, and heat maps. . Tableau Public is automatically touch-optimized for Android and iPad tablets. It supports all browsers without plug-ins. Advantages Disadvantages Fusion Tables . Good at turning geographic data into compelling maps, with the ability to upload your own custom KML . The offer spatial query processing and very thorough Google Maps integration. . Access must be authenticated . There is no organization to datasets Tableau . It is fast bringing data due to its in-memory analytical engine . It has nativeconnectors to Cloudera Impala and Cloudera Hadoop, DataStax Enterprise, Hortonworks and MapR Hadoop Distribution for Hadoop reporting and analysis . It has powerful visualization capabilities that let you create maps, charts and dashboards easily . It is not open source Table 4: Visualization
  • 25. 1.1.5. Search Search is well suited to leverage a lot of different types of information, especially unstructured information. One of the first things any organization is going to want to do once it accumulates a mass of Big Data is search it. 1.1.5.1. Lucene Lucene is a Java-based search library. It has an architecture that employs best practice relevancy ranking and querying, as well as state of the art text compression and a partitioned index strategy to optimize both query performance and indexing flexibility. More features: . Speed — sub-second query performance for most queries. . Complete query capabilities: keyword, Boolean and +/- queries, proximity operators, wildcards, fielded searching, term/field/document weights, find-similar, spell-checking, multi-lingual search and others. . Full results processing, including sorting by relevancy, date or any field, dynamic summaries and hit highlighting. . Portability: runs on any platform supporting Java, and indexes are portable across platforms – you can build an index on Linux and copy it to a Microsoft Windows machine and search it there. . Scalability — there are production applications in the hundreds of millions and billions of documents/records. . Low overhead indexes and rapid incremental indexing. 1.1.5.2. Solr Solr is a standalone enterprise search server with a REST-like API. You put documents in it (called "indexing") via XML, JSON, CSV or binary over HTTP. You query it via HTTP GET and receive XML, JSON, CSV or binary results. Solr is highly scalable, providing distributed search and index replication. More features: . Advanced full-text search capabilities. . Optimized for high volume web traffic. . Standards based open interfaces - XML, JSON and HTTP. . Comprehensive HTML administration interfaces. . Server statistics exposed over JMX for monitoring.
  • 26. . Linearly scalable, auto index replication, auto failover and recovery. . Near real-time indexing. . Flexible and adaptable with XML configuration. . Extensible plugin architecture. Advantages Disadvantages Lucene . It is the core search library (it's a library for indexing and searching text) . ACID (or near ACID) is not guaranteed. A crash while writing to Lucene index might render it useless Solr . It is the logical starting point for developers building search applications . It is good at reads . Documents update instead of fields (so when you havea million documents that say "German" and should say "French", you have to reindex a million documents) . It takes too long to update and commit Table 5: Search