Apache Hadoop- Hadoop Basics.pptx

APACHE HADOOP
Miraj Godha
April 22,
2014
MIRAJ GODHA 1

MIRAJ GODHA
LEARN ANY THING FROM ANY WHERE.
We are one of the fastest growing Online learning destination for
Instructor led Online live courses.
Every one of our courses, written by experts in their respective
fields. And therefore are crafted to help you grow and advance
your career.
We try our best to make you connect to the real life examples with
real business practices.
Learn and apply to your work.
We bring you the most cutting edge and industry relevant
courses
MIRAJ GODHA 2

COURSE DETAILS
The Motivation for Hadoop
Hadoop: Basic Concepts
Writing a MapReduce Program
Common MapReduce Algorithms
PIG Concepts
Hive Concepts
Working with Sqoop
OOZIE Concepts
HUE Concepts
Data Visualization & Analytics
Final Project
MIRAJ GODHA 3

APACHE HADOOP
THE MOTIVATION FOR
HADOOP
Miraj Godha
April 22, 2014
MIRAJ GODHA 4

MACHINE GENERATED AND
HISTORICAL DATA
MIRAJ GODHA 6

THREE V’S OF BIGDATA
Volume
Velocity
Variety
MIRAJ GODHA 7

VOLUME .. AMOUNT OF DATA
~3 ZB of data
exist in the
digital universe
today.
>300 TB of
data in U.S.
Library of
Congress.
Facebook has
30+ PB.
~2.5 PB of data
in DWH.
+10PB DWH
size.
MIRAJ GODHA 8

VELOCITY .. HOW RAPIDLY
DATA IS GROWING
48 hours of new
video every
minute
571 new
websites every
minute
500+ TB to
Facebook.
175 million
tweets every
day
1+ million
customer
transactions
every hour
Data production
will be 44 times
greater in 2020
than it was in
2009.
MIRAJ GODHA 9

VARIETY.. HOW RAPIDLY
DATA IS GROWING
Structured
•Traditional Databases
•Numeric data
Semi - structured
•Json
•XML
Unstructured
•Text documents
•Email
•Video
•Audio
•Machine Generated
MIRAJ GODHA 10

HOW COMPANIES MINTING
ON BIGDATA!
Predict exactly what customers want before they ask for it
Marketing Campaign
Improve customer service
Fraud Detection
Get customers excited about their own data
Identify customer pain points and solve them
Reduce health care costs and improve treatment
Social Graph Analysis & Sentiment Analysis
Research and development
MIRAJ GODHA 11

HOW DATA IS USED BY SOME
BIG COMPANIES FOR
DIFFERENT BUSINESS
ANALYSIS.
MIRAJ GODHA 12

BIG DATA MARKET FORECAST
MIRAJ GODHA 13

HADOOP & HIVE HISTORY
Dec 2004 – Google GFS paper published
July 2005 – Nutch uses MapReduce
Feb 2006 – Becomes Lucene subproject
Apr 2007 – Yahoo! on 1000-node cluster
Jan 2008 – An Apache Top Level Project
Jul 2008 – A 4000 node test cluster
Sept 2008 – Hive becomes a Hadoop subproject
MIRAJ GODHA 15

PROBLEMS WITH
CURRENT SYSTEMS
1 Machine
• Read 1 TB data
• 4 I/O operations
• 100 Mbps
MIRAJ GODHA 16
~45
mins

APACHE HADOOP WINS TERABYTE SORT BENCHMARK
(JULY 2008)
Yahoo's sorted 1 TB data in 209 seconds
Beat the previous record of 297 seconds of Google.
The sort used 1800 mappers and 1800 reduces
Cluster configuration used for benchmark sort
 910 nodes
 2 quad core Xeons @ 2.0ghz per node
 8G RAM per a node;
MIRAJ GODHA 17

WHY HADOOP?
1 Machine
• Read 1 TB data
• 4 I/O operations
• 100 Mbps
10 Machines
4 I/O operations
100 Mbps
MIRAJ GODHA 18
~45
mins
~4.5
mins

DISTRIBUTED FILE SYSTEM
(DFS)
Miraj Godhaproject
Miraj
Godha.global.inhomeproject
Miraj
Godha.global.inhomeimages
Miraj
Godha.global.inhomesoftware
Miraj
Godha.global.inhomewebsites
MIRAJ GODHA 19
Miraj Godhasoftware
Miraj Godhaimages
Miraj Godhawebsites
Namespace
Miraj
Godha.global.in

WHO USES HADOOP?
MIRAJ GODHA 20
42,000
nodes as on
July 2011
4100 nodes
1400
nodes

WHAT IS HADOOP
Hadoop is a framework for distributed processing of large datasets
across large clusters of commodity computers using simple
programing model.
 Large datasets  Terabytes or petabytes of data
 Large clusters  hundreds or thousands of nodes
Hadoop is open-source implementation for Google MapReduce
Hadoop is based on a simple programming model called MapReduce
Hadoop is based on a simple data model, any data will fit
MIRAJ GODHA 21

WHAT MAKES IT ESPECIALLY
USEFUL
 Scalable: It can reliably store and process petabytes.
 Economical: It distributes the data and processing across clusters of commonly
available computers (in thousands).
 Efficient: By distributing the data, it can process it in parallel on the nodes where
the data is located.
 Reliable: It automatically maintains multiple copies of data and automatically
redeploys computing tasks based on failures.
MIRAJ GODHA 22

HADOOP: ASSUMPTIONS
Hardware will fail.
Applications need a write-once-read-many access model.
Data transfer and I/o is bottleneck
Very Large Distributed File System
– 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
– Files are replicated to handle hardware failure
– Detect failures and recovers from them
Move logic rather than data
MIRAJ GODHA 23

Secondary
NameNode
Client
HDFS Architecture
NameNode
Data Nodes
Metadata
NameNode : Contains information about data
DataNode : Contains physical data
SecondaryNameNode: Keeps reading data from NN
MIRAJ GODHA 24

DISTRIBUTED FILE SYSTEM
Single Namespace for entire cluster
Data Coherency
– Write-once-read-many access model
– Client can only append to existing files
Files are broken up into blocks
– Typically 64 MB block size
– Each block replicated on multiple DataNodes
Intelligent Client
– Client can find location of blocks
– Client accesses data directly from DataNode
MIRAJ GODHA 25

HADOOP COMPLEX
QUERIES COMPARISON
WITH TRADITIONAL DB’S
26

WHICH HADOOP
DISTRIBUTION?
Type Distribution Pros Cons
Apache/Open
Source
Hortonworks 100% Open source version
Integration/Services focused
Extensive partnership network
Slower interactive queries
Cloudera Widely used distribution
Faster interactive queries
Extensive tooling
Proprietary extensions like
Impala
Commercial version only
MapR Enterprise and Production ready
focused
Works with NFS & Native Unix
commands
Less focused on using new
Hadoop features such as Yarn,
etc
Proprietary PivotalHD Faster interactive query support with
Greenplum
Integrates with CloudFoundry PaaS
platform
Proprietary extensions
Not easy to decouple
IBM Offer open source without branch
version
Integrated with PaaS and IBM tools
Limited releases
Expensive
May not be easy to decouple
27

MIRAJ GODHA 28
Disk 9
1 3
2
Racks
Disk 10
Disk 11
Disk 12
Disk 8
Disk 7
Disk 6
Disk 4
Disk 3
Disk 2
Disk 5
Disk 1
1
1
1
2
2 2
3
3
3
Data blocks
Rack 1 Rack 2 Rack 3
1 2 3 4 5
File F
Blocks (64 M

BLOCK PLACEMENT
Current Strategy
-- One replica on local node
-- Second replica on a remote rack
-- Third replica on same remote rack
-- Additional replicas are randomly placed
Clients read from nearest replica
MIRAJ GODHA 29

MAIN PROPERTIES OF HDFS
Large: A HDFS instance may consist of thousands of server machines, each
storing part of the file system’s data
Replication: Each data block is replicated many times (default is 3)
Failure: Failure is the norm rather than exception
Fault Tolerance: Detection of faults and quick, automatic recovery from
them is a core architectural goal of HDFS
 Datanodes send heartbeats to Name node
MIRAJ GODHA 30

NAMENODE METADATA
Meta-data in Memory
Types of Metadata
– List of files
– List of Blocks for each file
– List of DataNodes for each block
– File attributes, e.g creation time, replication factor
A Transaction Log
– Records file creations, file deletions. etc
MIRAJ GODHA 32

HA CLUSTER
Two separate machines are configured as NameNodes.
At any point in time, exactly one of the NameNodes is in an Active state, and the other is in
a Standby state.
The Active NameNode is responsible for all client operations in the cluster
In order for the Standby node to keep its state synchronized with the Active node, both nodes
communicate with a group of separate daemons called "JournalNodes" (JNs).
When any namespace modification is performed by the Active node, it durably logs a record of the
modification to a majority of these JNs.
The Standby node is capable of reading the edits from the JNs, and is constantly watching them for
changes to the edit log.
In the event of a failover, the Standby will ensure that it has read all of the edits from the JounalNodes
before promoting itself to the Active state.
In order to provide a fast failover, it is also necessary that the Standby node have up-to-date
information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are
configured with the location of both NameNodes, and send block location information and heartbeats
to both.
During a failover, the NameNode which is to become active will simply take over the role of writing to
the JournalNodes.

DATANODE
A Block Server
– Stores data in the local file system
– Stores meta-data of a block
– Serves data to Clients
Block Report
– Periodically sends a report of all existing blocks to the
NameNode
Facilitates Pipelining of Data
– Forwards data to other specified DataNodes
MIRAJ GODHA 34

HADOOP MASTER/SLAVE
ARCHITECTURE
Hadoop is designed as a master-slave shared-nothing
architecture
MIRAJ GODHA 35
Master node (single node)
Many slave nodes

JOB SUBMISSION
MIRAJ GODHA 36
User
DFS
Copy job resources(
Jar file,
configuration file,
computed input
splits) in jobID
directory)
Job Submitter
Submit
job
Input Splits
Job Tracker
Submit
Job
Computes
Input Splits
Get new
JobID

JOB TRACKER
MIRAJ GODHA 37
Job
Submitte
r
DFS
Job Tracker
Submit
job
Put job
Job.XML
Job.jar
Create Map
& Reduce
Internal Job Queue
M
R
S
S
S
S
S
S
No of maps =
Input splits
Read Job
Files, split
information

TASK ASSIGNMENT
MIRAJ GODHA 38
Job Tracker
Task Tracker
Picks Job
Heart
Beat
Job Queue
Assign
Tasks from
Job
Initialize Job

TASK EXECUTION
MIRAJ GODHA 39
Task Tracker
Job Tracker
Read
from
local
Disk
DFS
Assign
Task
JVM JVM
JVM JVM
Job.xml
Job.jar
Heart
Beat

JOBTRACKER
Master node runs JobTracker instance, which accepts Job requests
from clients
There is only one JobTracker daemon running per hadoop cluster
Determine the execution plan by determining which files to process
Assigns Nodes to different task
Monitor all tasks as they are running
MIRAJ GODHA 40

TASKTRACKER
Manages execution of individual tasks on each data node
One TaskTracker each data node
Each TaskTracker can spawn multiple JVM’s to handle many map or
reduce task in parallel
TaskTracker constantly communicate with job tracker
JobTracker fails to receive heartbeat from TaskTracker in specified
amount of time, it assumes the task tracker has crashed. In such a
scenario, job tracker will resubmit the task to some other
TaskTracker.
MIRAJ GODHA 41

HEARTBEATS
DataNodes send hearbeat to the NameNode
NameNode uses heartbeats to detect DataNode failure
MIRAJ GODHA 42

REPLICATION ENGINE
NameNode detects DataNode failures
 Chooses new DataNodes for new replicas
 Balances disk usage
 Balances communication traffic to DataNodes
MIRAJ GODHA 43

DATA PIPELINE & WRITE ANATOMY
MIRAJ GODHA 44
HDFS Client
Add
Block
Name Node
Data Node
Data Node
Data Node
Write
Ack
Complet
e

DATA PIPELINING
Client retrieves a list of DataNodes on which to place replicas of
a block
Client writes block to the first DataNode
The first DataNode forwards the data to the next DataNode in
the Pipeline
When all replicas are written, the Client moves on to write the
next block in file
MIRAJ GODHA 45

READ ANATOMY
MIRAJ GODHA 46
HDFS Client Get Block Name Node
Data Node
Data Node Data Node
Read
Read

DATA CORRECTNESS
Use Checksums to validate data
– Use CRC32
File Creation
– Client computes checksum per 512 byte
– DataNode stores the checksum
File access
– Client retrieves the data and checksum from DataNode
– If Validation fails, Client tries other replicas
MIRAJ GODHA 47

Apache Hadoop- Hadoop Basics.pptx

More Related Content

Similar to Apache Hadoop- Hadoop Basics.pptx

More from Miraj Godha

Recently uploaded

Apache Hadoop- Hadoop Basics.pptx