Hadoop fundamentals

IN THE NAME OF ALLAH THE MOST
GRACIOUS THE MOST MERCIFUL

Hadoop Fundamentals
Presentation by: Mr. Awais Qureshi

Agenda
▪ Introduction
▪ What is Hadoop?
▪ Why we need Hadoop?
▪ Architecture of Hadoop
▪ What can we do with Hadoop?
▪ Hands On Exercise
NUST School of Electrical Engineering & Computer Science 4/8/2016 3

Agenda
▪ Introduction
▪ What is Hadoop?

What is Hadoop
Definition + Brief History of Hadoop
NUST School of Electrical Engineering & Computer Science (SEECS) 4/8/2016 7

What Is Apache Hadoop?
▪ The Apache™ Hadoop® project develops open-source software
for reliable, scalable, distributed computing.
▪ Apache top level project, open-source implementation of
frameworks for reliable, scalable, distributed computing and data
storage.
▪ It is a flexible and highly-available architecture for large scale
computation and data processing on a network of commodity
hardware.
4/8/2016NUST School of Electrical Engineering & Computer Science (SEECS) 8

History..
▪ Hadoop was created by Doug Cutting, the creator of Apache
Lucene, the widely used text search library.
▪ Hadoop has its origins in Apache Nutch, an open source web
search engine, itself a part of the Lucene project.

Evolution of Hadoop!

Evolution (Cont.…)
▪ Apache Hadoop was inspired by Google’s MapReduce and
Google File System papers and cultivated at Yahoo!
▪ It started as a large-scale distributed batch processing
infrastructure.
▪ Designed to meet the need for an affordable, scalable and flexible
data structure that could be used for working with very large data
sets.

Hadoop as Inspiration By Google
2003
2004
2006

Evolution (cont..)

Another View…
4/8/2016
NUST School of Electrical Engineering & Computer Science
(SEECS)
14

Hadoop’s Developers..
▪ 2005: Doug Cutting and Michael J.
Cafarella developed Hadoop to
support distribution for
the Nutch search engine project.
▪ The project was funded by Yahoo.
▪ 2006: Yahoo gave the project to
Apache Software Foundation.
Doug Cutting

The Origin of the Name “Hadoop”
▪ The name Hadoop is not an acronym; it’s a made-up name. The
project’s creator, Doug Cutting, explains how the name came
about:
▪ The name my kid gave a stuffed yellow elephant.
▪ Short, relatively easy to spell and pronounce, meaningless, and
not used elsewhere: those are my naming criteria.
▪ Kids are good at generating such. Googol is a kid’s term.
▪ Subprojects and “contrib” modules in Hadoop also tend to have
names that are unrelated to their function, often with an elephant
or other animal theme (“Pig,” for example)

Agenda
▪ Introduction
▪ What is Hadoop?

The Risen of Big Data

Big Data
▪ We live in the age of big data
▪ The data volumes we need to work with on a day-to-day basis have outgrown
the storage and processing capabilities of a single host.
▪ Big data brings with it two fundamental challenges:
▪ how to store and work with voluminous data sizes
▪ how to understand data and turn it into a competitive advantage

Examples:
▪ This flood of data is coming from many sources:
▪ The New York Stock Exchange generates about one terabyte of new trade data per
day.
▪ Facebook hosts approximately 10 billion photos, taking up one petabyte of storage.
▪ Ancestry.com, the genealogy site, stores around 2.5 petabytes of data.
▪ The Internet Archive stores around 2 petabytes of data, and is growing at a rate of 20
terabytes per month.
▪ The Large Hadron Collider near Geneva, Switzerland, will produce about 15
petabytes of data per year.
▪ So there’s a lot of data out there…..

Evolution…
▪ Before Hadoop, big data required a lot of computing power, storage, and
parallelism.
▪ Meant that organizations had to spend a lot of money to build the infrastructure
needed to support big data analytics.
▪ Given the large price tag, only the largest Fortune 500 organizations could
afford such an infrastructure.

Atlast Hadoop!
▪ It is very difficult to store, compute & analyze these massive volumes of data.
▪ Hadoop is designed to answer the question:
▪ “How to process big data with in reasonable cost and time?”

Agenda
▪ Introduction
▪ What is Hadoop?
▪ Architecture/Anatomy of Hadoop

Architecture of Hadoop
Anatomy of Hadoop

Architecture of Hadoop Cluster

Anatomy of Hadoop – Basic Modules
▪ Hadoop Common: The common utilities that support the other Hadoop
modules.
▪ Hadoop Distributed File System (HDFS™): A distributed file system that
provides high-throughput access to application data.
▪ Hadoop YARN: A framework for job scheduling and cluster resource
management.
▪ Hadoop MapReduce: A YARN-based system for parallel processing of large
data sets.

Hadoop Distributed File System (HDFS)
▪ When a dataset outgrows the storage capacity of a single physical machine, it
becomes necessary to partition it across a number of separate machines.
▪ Filesystems that manage the storage across a network of machines are called
distributed filesystems.
▪ Hadoop comes with a distributed filesystem called HDFS, which stands for
Hadoop Distributed Filesystem.
▪ The Hadoop Distributed File System (HDFS) is a distributed file system
designed to run on commodity hardware.
▪ HDFS is highly fault-tolerant and is designed to be deployed on low-cost
hardware.

What HDFS Does?
▪ HDFS provides high throughput access to application data and is suitable for
applications that have large data sets.
▪ Provides scalable and reliable data storage, and it was designed to span large
clusters of commodity servers.
▪ HDFS has demonstrated production scalability of up to 200 PB of storage and a
single cluster of 4500 servers, supporting close to a billion files and blocks.
▪ HDFS is designed to store a very large amount of information (terabytes or
petabytes).
▪ Hence, spreading the data across a large number of machines.

What HDFS Does?
▪ HDFS should store data reliably. If individual machines in the cluster
malfunction, data should still be available.
▪ HDFS should provide fast, scalable access to this information.
▪ It should be possible to serve a larger number of clients by simply adding more
machines to the cluster.
▪ HDFS should integrate well with Hadoop MapReduce, allowing data to be read
and computed upon locally when possible.

How HDFS Works?
▪ An HDFS cluster is comprised of a NameNode, which manages the cluster
metadata, and DataNodes that store the data.
▪ Files and directories are represented on the NameNode by inodes.
▪ Inodes record attributes like permissions, modification and access times, or disk
space quotas.

HDFS Inside Story
▪ The file content is split into large blocks (typically 128 megabytes),
and each block of the file is independently replicated at multiple
DataNodes.
▪ The blocks are stored on the local file system on the DataNodes.
▪ The Namenode actively monitors the number of replicas of a
block.
▪ When a replica of a block is lost due to a DataNode failure or disk
failure, the NameNode creates another replica of the block.

HDFS Inside Story….
▪ NameNode sends instructions to the DataNodes by replying to heartbeats sent
by those DataNodes.
▪ The instructions include commands to:
▪ replicate blocks to other nodes.
▪ remove local block replicas.
▪ re-register and send an immediate block report.
▪ shut down the node.

4/8/2016
NUST School of Electrical Engineering & Computer Science
(SEECS)
34

Map Reduce
Called “Classic MapReduce” in Hadoop Ecosystem

History of MapReduce
▪ The actual origins of MapReduce are arguable, but the paper that most cite as
the one that started us down this journey is “MapReduce: Simplified Data
Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in 2004.
▪ This paper described how Google split, processed, and aggregated their data set
of mind-boggling size.
▪ Shortly after the release of the paper, a free and open source software pioneer by
the name of Doug Cutting started working on a MapReduce implementation to
solve scalability in another project he was working on called Nutch.
▪ an effort to build an open source search engine
▪ Over time and with some investment by Yahoo!, Hadoop split out as its own
project and eventually became a top-level Apache Foundation project.

What it is?
▪ MapReduce is a programming model and an associated implementation for processing
and generating large data sets.
▪ It was originally developed by Google and built on well-known principles in
parallel and distributed processing dating back several decades.
▪ MapReduce has since enjoyed widespread adoption via an open-source
implementation called Hadoop, whose development was led by Yahoo.
▪ .

What it is?
▪ Hadoop MapReduce is a software framework for easily writing distributed
applications
▪ Process vast amounts of data (multi-terabyte data-sets) in-parallel on large
clusters (thousands of nodes) of commodity hardware in a reliable, fault-
tolerant manner

How Map Reduce works?
▪ A MapReduce job usually splits the input data-set into independent chunks
which are processed by the map tasks in a completely parallel manner.
▪ The framework sorts the outputs of the maps, which are then input to the reduce
tasks.
▪ Typically both the input and the output of the job are stored in a file-system.
▪ The framework takes care of scheduling tasks, monitoring them and re-executes
the failed tasks.

▪ Typically the compute nodes and the storage nodes are the same, that is, the
MapReduce framework and the Hadoop Distributed File System are running on the
same set of nodes.
▪ This configuration allows the framework to effectively schedule tasks on the
nodes where data is already present, resulting in very high aggregate
bandwidth across the cluster.

▪ There are two types of nodes that control the job execution process: a jobtracker
and a number of tasktrackers.
▪ The jobtracker coordinates all the jobs run on the system by scheduling tasks to
run on tasktrackers.
▪ Tasktrackers run tasks and send progress reports to the jobtracker, which keeps
a record of the overall progress of each job.
▪ If a task fails, the jobtracker can reschedule it on a different tasktracker.

▪ Hadoop does its best to run the map task on a node where the input data resides
in HDFS.
▪ This is called the data locality optimization since it doesn’t use valuable cluster
bandwidth.
▪ Map tasks write their output to the local disk, not to HDFS. Why is this?
▪ Map output is intermediate output: it’s processed by reduce tasks to produce the final
output, and once the job is complete the map output can be thrown away.
▪ Storing it in HDFS, with replication, would be overkill!

▪ Hadoop divides the input to a
MapReduce job into fixed-size pieces
called input splits, or just splits.
▪ Hadoop creates one map task for each
split, which runs the user defined
map function for each record in the
split.
▪ Having many splits means the time
taken to process each split is small
compared to the time to process the
whole input.

YARN (MapReduce 2)

YARN?
▪ For very large clusters in the region of 4000 nodes and higher, the MapReduce
system described previously begins to hit scalability bottlenecks.
▪ In 2010 a group at Yahoo! began to design the next generation of MapReduce.
▪ The result was YARN, short for Yet Another Resource Negotiator.

YARN
▪ YARN meets the scalability shortcomings of “classic” MapReduce by splitting
the responsibilities of the jobtracker into separate entities.
▪ The Jobtracker takes care of both job scheduling
▪ matching tasks with tasktrackers
▪ Task progress monitoring
▪ keeping track of tasks and restarting failed or slow tasks
▪ doing task bookkeeping such as maintaining counter totals etc

YARN
▪ YARN separates these two roles into two independent daemons:
▪ A resource manager to manage the use of resources across the cluster.
▪ An application master to manage the lifecycle of applications running on the cluster.
▪ The idea is that an application master negotiates with the resource manager for cluster
resources:
▪ Described in terms of a number of containers each with a certain memory limit
▪ Runs application specific processes in those containers.
▪ The containers are overseen by node managers running on cluster nodes, which ensure
that the application does not use more resources than it has been allocated.

© Hortonworks Inc. 2013 - Confidential
Hadoop 1 Architecture
JobTracker
Manage Cluster Resources & Job Scheduling
TaskTracker
Per-node agent
Manage Tasks
Page 50

Apache Hadoop & YARN
▪ Apache Hadoop
–De facto Big Data open source platform
–Running for about 5 years in production at hundreds of companies like Yahoo, Ebay and
Facebook
▪ Hadoop 2
–Significant improvements in HDFS distributed storage layer. High Availability, NFS
–YARN – next generation compute framework for Hadoop designed from the ground up
based on experience gained from Hadoop 1
–YARN running in production at Yahoo for about a year
–YARN awarded Best Paper at SOCC 2013
Page 51

Hadoop 1 Limitations
Lacks Support for Alternate Paradigms and Services
Force everything needs to look like Map Reduce
Iterative applications in MapReduce are 10x slower
Scalability
Max Cluster size ~5,000 nodes
Max concurrent tasks ~40,000
Availability
Failure Kills Queued & Running Jobs
Hard partition of resources into map and reduce slots
Non-optimal Resource Utilization
Page 52

Our Vision: Hadoop as Next-Gen Platform
HADOOP 1.0
HDFS
(redundant, reliable storage)
MapReduce
(cluster resource management
& data processing)
HDFS2
(redundant, highly-available & reliable storage)
YARN
(cluster resource management)
MapReduce
(data processing)
Others
HADOOP 2.0
Single Use System
Batch Apps
Multi Purpose Platform
Batch, Interactive, Online, Streaming, …
Page 53

Page 54
Hadoop 2 - YARN Architecture
ResourceManager (RM)
Central agent - Manages and allocates
cluster resources
NodeManager (NM)
Per-Node agent - Manages and enforces
node resource allocations
ApplicationMaster (AM)
Per-Application –
Manages application
lifecycle and task
scheduling
Resource
Manager
MapReduce Status
Job Submission
Client
Node
Manager
Node
Manager
Container
Node
Manager
App Mstr
Node Status
Resource Request

YARN: Taking Hadoop Beyond Batch
Page 55
Applications Run Natively in Hadoop
HDFS2 (Redundant, Reliable Storage)
YARN (Cluster Resource Management)
BATCH
(MapReduce)
INTERACTIVE
(Tez)
STREAMING
(Storm, S4,…)
GRAPH
(Giraph)
IN-MEMORY
(Spark)
HPC MPI
(OpenMPI)
ONLINE
(HBase)
OTHER
(Search)
(Weave…)
Store ALL DATA in one place…
Interact with that data in MULTIPLE WAYS
with Predictable Performance and Quality of Service

What can we do with Hadoop?
▪ http://hortonworks.com/industry/

▪ Hands On Exercise…..

References:
1. www.hortonworks.com
2. www.hadoop.apache.org
3. www.cloudera.com
4. Hadoop: The Definitive Guide, Third Edition, Book
A Comprehensive Tutorial on Hadoop Cluster:
5. http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-
single-node-cluster/

Hadoop fundamentals

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hadoop fundamentals

Similar to Hadoop fundamentals (20)

Recently uploaded

Recently uploaded (20)

Hadoop fundamentals

Editor's Notes