HADOOP

Harinderjit Kaur
M.Tech(CSE)
PIT KAPURHALA

 What is the Need of Big data Technology
when we have robust, high-performing,
relational database management system
?

RDBMS
 Data Stored in structured format like PK, Rows,
Columns, Tuples and FK .
 It was for just Transactional data analysis.
 Later using Data warehouse for offline data.
 Massive use of Internet and Social Networking(FB,
Linkdin) Data become less structured.

 ‘Big Data’ is similar to ‘small data’, but bigger
 Datasets that grow so large that they become
awkward to work with using on-hand database
management tools. Difficulties include capture,
storage, search, sharing, and analytics.
What is Big Data?

3 V‘s of Big Data
Volume
• Data
quantity
Velocity
• Data
Speed
Variety
• Data
Types

Hadoop History
 2003 Doug Cutting was creating Nutch
 Open Source “Google”
 Web Crawler
 Indexer
 Crawler and Indexing processing was difficult
 Massive storage and processing problem
 In 2003 Google publishes GFS paper and in 2004
MapReduce paper
 Based on Google’s paper, Doug redesign Nutch and
delivered it in 2006 as Hadoop.

What is Hadoop?
 Framework of tools
 Open source maintained by and under Apache
License
 Support running apps for BigData
 Addressing the BigData challenges:
Variety
VelocityVolume

What is Hadoop?
 Hadoop is a software framework for distributed processing of
large datasets across large clusters of computers
Large datasets  Terabytes or petabytes of data
Large clusters  hundreds or thousands of nodes
 Hadoop is open-source implementation for Google
MapReduce
 Hadoop is based on a simple programming model called
MapReduce
 an open source software framework written in Java.

Hadoop makes it easier to
store, process and analyze
lot of data on commodity
hardware!

Apache Hadoop
 Developer(s) Apache Software Foundation
 Initial release December 10, 2011;
 Stable release 2.6.0 / Nov 18, 2014
 Development status Active
 Written in Java

 Operating system Cross-platform
 Type Distributed file system
 License Apache License 2.0
 Website hadoop.apache.org

Characteristics of Hadoop
 Scalable
A cluster can be expanded by adding new servers or
resources without having to move, reformat, or change
the dependent analytic workflows or applications.
 Cost effective
Hadoop brings massively parallel computing to
commodity servers. The result is a sizeable decrease in
the cost per terabyte of storage, which in turn makes it
affordable to model all your data.

 Flexible
Hadoop is schema-less and can absorb any type of
data, structured or not, from any number of sources.
Data from multiple sources can be joined and
aggregated in arbitrary ways enabling deeper analysis
than any one system can provide.
 Fault tolerant
When you lose a node, the system redirects work to
another location of the data and continues processing
without missing a beat.

Hadoop Master/Slave Architecture
 Hadoop is designed as a master-slave shared-nothing
architecture
16
Master node
(single node)
Many slave nodes

Hadoop Components
HDFS
Storage
Self-healing
high-bandwidth
clustered storage
MapReduce
Processing
Fault-tolerant
distributed
processing

HDFS Basics
 HDFS (Hadoop Distributed File System) is a file system
written in Java
 Sits on top of a native file system
 Provides redundant storage for massive amounts of data

Main Properties of HDFS
 Large: A HDFS instance may consist of thousands
of server machines, each storing part of the file
system’s data
 Replication: Each data block is replicated many
times (default is 3)
 Failure: Failure is the norm rather than exception
 Fault Tolerance: Detection of faults and quick,
automatic recovery from them is a core architectural
goal of HDFS
 Namenode is consistently checking Datanodes
19

Hadoop Distributed File System (HDFS)
20
Centralized namenode
- Maintains metadata info about files
Many datanode (1000s)
- Store the actual data
- Files are divided into blocks
- Each block is replicated N times
(Default = 3)
File F 1 2 3 4 5
Blocks (64 MB)

HDFS Data
 Data is split into blocks and stored on multiple
nodes in the cluster.
 Each block is usually 64 MB or 128 MB
 Each block is replicated multiple times.
 Replicas stored on different data nodes
 Large files, 100 MB+

2 Kinds of Nodes
Master Nodes Slave Nodes

Master Nodes
 NameNode
 only 1 per cluster
 metadata server and database
• JobTracker
• only 1 per cluster
• job scheduler

Slave Nodes
 DataNodes
 1-4000 per cluster
 block data storage
• TaskTrackers
• 1-4000 per cluster
• task execution

NameNode
 A single NameNode stores all metadata
 Filenames, locations on DataNodes of each block, owner,
group, etc.
 All information maintained in RAM for fast lookup
 File system metadata size is limited to the amount of
available RAM on the NameNode

Data Node
 DataNodes store file contents
 Different blocks of the same file will be stored on
different DataNodes
 Same block is stored on three (or more) DataNodes for
redundancy

MapReduce
 Programming model used by Google
 Input: a set of key/value pairs
 User supplies two functions:
 map(k,v)  list(k1,v1)
 reduce(k1, list(v1))  v2
 Map
 Process a key/value pair to generate intermediate
key/value pairs
 Reduce
 Merge all intermediate values associated with the same
key

MapReduce
JobTracker
MapReduce job
submitted by
client computer
Master node
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance
TaskTracker
Slave node
Task instance

Properties of MapReduce Engine
Job Tracker is the master node (runs with the
namenode)
Receives the user’s job
Decides on how many tasks will run (number of mappers)
30
• This file has 5 Blocks  run 5 map tasks
Node 1 Node 2 Node 3

Properties of MapReduce Engine (Cont’d)
Task Tracker is the slave node (runs on each datanode)
Receives the task from Job Tracker
Runs the task until completion (either map or reduce task)
Always in communication with the Job Tracker reporting progress
31
Reduce
Reduce
Reduce
Map
Map
Map
Map
Parse-hash
Parse-hash
Parse-hash
Parse-hash
In this example, 1 map-reduce
job consists of 4 map tasks
and 3 reduce tasks

How Map and Reduce Work
Together
Map returns
information
Reduce accepts
information
Reduce applies a
user defined
function to
reduce the
amount of data

Lifecycle of a MapReduce Job
Map function
Reduce function
Run this program as a
MapReduce job

Hadoop Workflow
Hadoop Cluster
You
1. Load data into HDFS
2. Develop code locally
3. Submit MapReduce job
3a. Go back to Step 2
4. Retrieve data from HDFS

HADOOP

More Related Content

What's hot

Viewers also liked

Similar to HADOOP

Recently uploaded

HADOOP