2. Apache Hadoop
• Apache Hadoop is an open-source software
framework that supports data-intensive distributed
applications, licensed under the Apache v2 license.
• Hadoop was created by Doug Cutting and Mike
Cafarella in 2005.
• Named it after his son's toy elephant. It was
originally developed to support distribution for the
Nutch search engine project.
5/18/2022 2
3. Hadoop is an open source project initiated by Apache
foundation that enables processing of large data sets in a
distributed manner. The core of Hadoop mainly consists of
two things:
• MapReduce
• HDFS (Hadoop Distributed File System)
4. • MapReduce is a framework or a programming model that allows carrying
out tasks in parallel across a large cluster of computers. It mainly consists
of two functions namely Map and Reduce.
5.
6.
7. TaskTracker
Map-Reduce Architecture
JobTracker - Master
Client
TaskTracker
TaskTracker -
Slaves
Run Map and Reduce task
Manage intermediate output
UI for submitting jobs
Polls status information
Accepts MR jobs
Assigns tasks to slaves
Monitors tasks
Handles failures
Task
Run the Map and Reduce
functions
Report progress
8. 1. Client submits the job to JobTracker running on the Namenode of the Hadoop Cluster.
2. The Jobtracker generates and returns a job id for the submitted MapReduce task to the
client. This id is used by the client or the Namenode to stop or kill the job if needed.
3. The job resources such as the required jar files, metadata files, input files to the MapReduce
tasks are copied from the client to the HDFS that can be accessed by the Namenode as well
as Datanodes for processing.
4. The Job Tracker is now schedules the job to the Tasktracker running on different Datanodes.
5. The Tasktracker runs either the Map tasks or Reduce tasks as assigned by the Jobtracker.
Once the job is finished the results are returned to the Jobtracker. It keeps sending the
heartbeat messages to the Jobtracker indicating the Datanode is up and running.
6. The Jobtracker collects the final result from all the Datanodes and returns to the client in a
prescribed format.
9.
10. Support of Hadoop for Big data
• HBase: It is a distributed, column-oriented database that is on
top of HDFS. The data model of HBase allows scalability of
data beyond the traditional relational database systems by
grouping the columns of data into Columnfamilie.
• Hive: It is data warehouse that allows querying on large
datasets stored in HDFS using SQL like language interface
called HiveQL. Hive is used for ad-hoc queries, data-
summarization analysis of large data sets stored in HDFS.
5/18/2022 10
11. • Pig: Apache pig is one the Hadoop platforms that helps in
analyzing large data sets stored in the Hadoop file system. Pig
latin, a high level procedural language facilitates in analyzing
the data sets. It provides Hadoop users to query on the data
sets without Map reduce knowledge by allowing simple
queries similar to SQ .
• Sqoop: It is a command line interface tool that allows transfer
of data between the structural databases and Hadoop
platforms which might be either of HDFS or Hive or HBase. It
also allows exporting data back to the relational databases.
5/18/2022 11
12. • Mahout: It is a library where the primary goal is to
build or create scalable machine learning algorithms.
• Ambari: A web user interface that helps in
monitoring, provisioning and managing Hadoop
clusters with RESTful APIs. The components of
Hadoop that are supported by Hadoop are HDFS,
Mapreduce, Hive, Sqoop, Pig, Ozzie, Zookeeper,
Hcatalog.
5/18/2022 12
13. • Chukwa: It is an open source data collection system that helps
in monitoring large distributed systems. It is built on top of
HDFS with support of map reduce and thus inherits
robustness and scalability.
• Avro: It is a framework that helps in performing data
serialization and remote procedure calls. It is most favorable
for scripting languages such as Pig as it facilitates in
transferring the data from one program or language to other
(such as from C to Pig).
5/18/2022 13
14. • Cassandra: A multi-master database with high
availability,scalability and performance. It can serve as both
real-time operational datastore as well as a read-intensive
database for business intelligence applications. It supports
replication across multiple data centres and is a perfect
platform for mission-critical data.
• Zookeeper: It is a centralized service that maintains all the
configuration details of distributed file system. The
configuration details include the naming, distribution and
synchronization of the services.
5/18/2022 14
15. • Oozie: It is a scheduler that helps in managing Hadoop jobs.
An application may require multiple map reduce jobs to run.
Oozie helps in managing the workflows between these jobs by
managing workflow instances, its variables and the control
dependencies among them.
• Flume: It is a distributed service which helps in collecting and
aggregating large amounts of log data. It seems similar to
Chukwa but the difference is Flume is used for near –real time
analytics while Chukwa is used for batch oriented or periodic
analytics.
5/18/2022 15
16. • BigSQL: It is SQL interface developed by IBM for its Hadoop
platform Infosphere Biginsights. It does not turn the Hadoop
into a relational database but rather provides the developers
with SQL knowledge to create tables for the data stored in
Hive, HBase and in the distributed file system.
5/18/2022 16
17. • Stinger: It’s an initiative from HortonWorks and
Microsoft to improve SQL interface of Hive and to
improve the speed of Hive queries execution much
faster.
• Apache drill: It aims at providing real-time query
execution on the data stored in Hadoop. The goal of
this project is to provide the results of query on
Hadoop with petabytes to trillions of data in less
than a second.
5/18/2022 17