4. December 1, 2017 4
Types of Data
Data is classified into 3 types:
•Structured Data
•Unstructured Data
•Semi-structured Data
Structured Data
•Fits into the world of RDBMS
•Data is perfectly aligned in rows and columns
•A tabular format is used for representing data
Example: Data storage in MySql Database
Introduction to BigData
5. December 1, 2017 5
Unstructured Data
•No definite structure can be assigned to this data
•Cannot tabulate the data
•Cannot put in rows and columns
•Cannot fixed into any schema
Example: Text files, PDF document, Web server logs, Text,
Photos, Voice
Semi-structured Data
•Data which is between structured and Unstructured
•Unstructured data embedded within some structures
or tags or schema
Example: XML file
Introduction to BigData
6. December 1, 2017 6
Big Data Sources:
•New York Stock Exchange 4 to 5 TB of data per day
•Internet stores 18.5 petabytes of data on a day
•Twitter handles 12 TB of tweets everyday
•Facebook :
1.5 billion active users monthly
300 PB of user data
10 billion messages per day
Big Data – The buzz word
•80 to 90% of data is unstructured and that is cannot be fitted
into RDBMS based systems.
•Big Data is Efficient, Economical and Quicker
Introduction to BigData
7. December 1, 2017 7
Characteristics of BigData:
(3 V’s is now 6 V’s)
6V’s
Volume
Velocity
Variety
Varacity
Value
Variability
Introduction to BigData
8. December 1, 2017 8
Volume: Sheer bulk of data being generated.
Velocity: Rate at which data is generated
Ex: 5 to 10 TB of data is uploaded on YouTube every 10 mins.
Variety: 80 to 90% of Big Data is unstructured and semi-structured
Ex: Text data, Voice data, Sensor data etc…
Veracity: Uncertainity or correctness of data
Ex: Collecting data from sensor
Variability: Inconsistencies in the rate at which data is generated
Value: What is the value of proposition?
What is business value that it makes?
The output should be worth the investments made to analyze data.
Introduction to BigData
9. December 1, 2017 9
Use Cases of Big Data
Financial Services
•Fraud detection
•Personalized banking services
Health Care
•Analysis previously restricted to the major players
due to expensive tools and technologies
•Now since the advent of big data the technology is not very
expensive
•Many players in health care segment will start working on big
data
Introduction to BigData
10. December 1, 2017 10
Retail Industry
•Oldest consumers of big data
•Were using data warehousing techniques very heavily
•Now slowly shifting to big data technology
Web and Social Media Analytics
•Most recent entrant
•Heavily looking into big data related research works for
Behavioral
Social Analytics
Introduction to BigData
11. December 1, 2017 11
RDBMS and BIG DATA
Benefits of RDBMS:
•Compatibility
• Flexibility
•Simplicity
•Performance
•Robustness
Well known RDBMS
•Oracle
•Microsoft SQL server
•MySQL
•Teradata
•DB2
Introduction to BigData
12. December 1, 2017 12
RDBMS:
•Normalization
Data consistency
Eliminates data duplication
•Relational databases have to be incredibly complex
internally
Example: Simple select statement could have hundreds
of query execution parts
RDBMS determines the execution plan using cost based
algorithms
Introduction to BigData
13. December 1, 2017 13
Drawbacks of RDBMS:
•New demand is scalability
•New apps being launched: Massive load on storage and
scalability
•Supporting large number of concurrent users
•Dynamic support is needed
•Scaling:
A new application can go viral overnight, users increase
from zero to million
Some users are frequent, others never return back
Seasonal swings can create spikes
Users need real-time high performance
Vertical scaling possible, not ready for horizontal scaling
Single server node
Introduction to BigData
14. December 1, 2017 14
Today’s demand
Increased workload due to flexibility requirement
Database structures needs to be altered
Example: Started selling televisions
•Database schema defined for television
•Added refrigerator and music system to
catalogue
Introduction to BigData
15. December 1, 2017 15
Introduction to Parallel computing:
Divide a task and conquer
Computer 1
Computer 3
Computer 2
Computer 4
• Reduced time by Parallel processing
• Faster and quicker
Task
Introduction to BigData
16. December 1, 2017 16
Challenges of Super Computing:
•General purpose operating system did not exist
•Buyer of super computers locked to vendors for
hardware support
•High initial cost of the Hardware
•High cost of software maintenance and upgrades to be
taken care in-house
•Develop custom software for individual use cases
HADOOP – The rescue
• General purpose operating system like framework
• Built-in rich features set of software tools and components
• Not locked with one vendor, can be installed on any commodity
hardware
• Mid sized organizations can afford
• Free software (open source) with free upgrades
• Distributed computing to wider set of audience
Introduction to BigData
17. December 1, 2017 17
Hadoop History:
2000-2002
•Project NUTCH
•Open source scalable and robust internet search engine
•Doug Cutting
and Mike
2003-2004
•Big Table
•Map Reduce
•Common features with NUTCH
2006
•Doug cutting joined yahoo and created hadoop
•NUTCH +Big table + Google MapReduce
•Hadoop MapReduce implemented in Java
2008
•Hadoop: Apache project
•Stable version of Hadoop used in Yahoo
Introduction to BigData
19. December 1, 2017 19
Overview of Hadoop Architecture
Basic components of Hadoop 1.0
•Name node
•Secondary name node
• Job tracker
•Task tracker
•Data nodes
Job in Hadoop Ecosystem
•A job is some task submitted by the user to the Hadoop cluster
•The job is in the form of a program or collection of programs (a JAR file)
which needs to be executed
20. December 1, 2017 20
Attributes:
•Programs
•Input data to the program i.e a file or collection of files in a
directory
•Output directory where the results of execution is collected
in a files
Job
•Java mapreduce jobs.
•Programs will be submitted into the cluster in the form of a
JAR file.
•Packaging of all the classes.
•Programs need to executed on the particular set of data.
Overview of Hadoop Architecture
21. December 1, 2017 21
Overview of Hadoop Architecture
Apache Hadoop Core Features
Hadoop Distributed File System(HDFS)
•When file is submitted into the Hadoop cluster:
It resides on multiple data nodes
Original file divided into smaller pieces
Parallel Processing Framework
•Robust
•Known as a MapReduce Framework
•When you submit the job into the Hadoop cluster:
Program executes on a piece of data
Runs on multiple machines
22. Cluster of
1000 nodes
Name
node
Secondary name
node
Job
tracker
Remaining 997 machines work in the slave mode and act as
data nodes
Master Slave Architecture
Overview of Hadoop Architecture
23. Need not have a very high hard drive storage space
•Data nodes need to be very high in terms of hard drive
storage space
•Nodes take the load of all the data
•Big data storage consist of the bulk of data operations
Name node Secondary name node Job tracker
Overview of Hadoop Architecture
24. December 1, 2017 24
Master Slave Architecture: Host OS
•Hadoop as a piece of software framework is installed on native operating
system
•Installed on all these machines along with the data nodes
•The differentiating factor is the software configuration after installing
Hadoop
•Machines perform responsibilities associated with the name node,
secondary name node and job tracker
Hadoop Cluster Setup
•Machines within a rack communicate with the help of a switch at
the speed of 10 gigabytes per second
•Multiple racks communicate with the help of a multi layer switch or
uplink switch which also acts as a router
•Data transfer speed between machines within the same rack is
higher than the data transfer speed between the machines across
different racks
Overview of Hadoop Architecture
25. General Specifications of Hadoop Cluster
December 1, 2017 25
Built-in using Commodity Hardware
•Makes the hardware easy to procure and maintain
•Reduces dependency on just one vendor
Processor Built
•Most of the data nodes has two hex-core processor or two
octa-core processor
•2 CPU, each of them at 8 cores
•Processing speed lies anywhere between 2.4
to 3.5 gigahertz CPU
Overview of Hadoop Architecture
26. December 1, 2017 26
Storage
•Amount of RAM varies according to the organizational needs
•Name node and job tracker would have higher RAM
•Most of the data nodes will be ranging between 50 to 500 GB of
high speed RAM
Thumb Rule
•To decide how much hard drive storage is needed for each data
node
•Every single core of CPU requires at least 2 Terabytes of hard
drive
Overview of Hadoop Architecture
27. December 1, 2017 27
Hadoop Services
1.Once Hadoop is installed, certain services are enabled
2.Processes or services are associated with the name node running
3.The machine acquires the role of that data node
4.These are set of Hadoop services or set of Hadoop daemons
running
5.A set of software processes or collection of several processes
6.A set of software processes or collection of several processes
Overview of Hadoop Architecture
28. Standalone Mode
Pseudo Distributed Mode
Three different modes
in which Hadoop can
be installed and
deployed
Fully Distributed Mode
(Cluster Mode)
Hadoop is installed on the cluster of interconnected machines
Hadoop Deployment Modes
29. December 1, 2017 29
Hadoop Deployment Modes
Hadoop Components: Standalone Mode
This mode is
mainly used
for testing
purposes
These are the software
services which actually run
as a part of your Hadoop
installation
Job tracker Name node
If JVM crashes, all Hadoop
services will also crasha
Secondary name
node
Data
node
It's the least
preferred mode
of Hadoop
installation and
deployment
If all the services are
sharing a single JVM,
this mode is called
the Standalone
Mode
30. Each of the Hadoop
services run on a
separate JVM
Services run as a
part of Hadoop
installation
Similarity Both
run on a single
machine
Crashing of JVM does not
impact Hadoop cluster
When JVM
crashes, all the
Hadoop services
also crash
Standalone Mode
Pseudo
Distributed Mode
Pseudo
Distribute
d Mode
vs.
Standalon
e Mode
This mode is widely used for learning and development
purpose but not for deployment
Hadoop Components: Pseudo Distributed Mode
Hadoop Deployment Modes
31. December 1, 2017 31
Hadoop Deployment Mode
Hadoop Components: Fully Distributed Mode
• Used in real production environment
• If job tracker is configured on a machine, the dedicated
machine runs only the job tracker with Hadoop installation
• When the dedicated machine is working as name node, it is
when the particular hardware is running on name node
services
Real VS Pseudo Distributed Mode
• All the Hadoop services are well interconnected but on a
separate JVM in Real Distributed Mode
• All the Hadoop services are on a different JVM but on a single
machine in Pseudo Distributed mode
32. December 1, 2017 32
Functionalities of hadoop components
Functionalities of Hadoop Components
•Name Node
All the information is available at name node
It is a centralised file namespace server or a file system server
• Secondary Name Node
Helps the name node to backup the data present in the name
node server periodically
In the event of a name node failure, the secondary name node
will be used to recover and restore the name node
33. Hot standby means
that secondary
name node will work
in a UPS mode
Functionalities of Hadoop Components
When name node is
down, the entire
cluster goes down in
the case of Hadoop 1.0
As soon as power
is down UPS starts
backing up
Receives an
uninterrupted
power supply
mode
Not the case
with Hadoop
2.0
Hadoop 1.0 did not have the provision of hot standby
name node
Functionalities of hadoop components
34. When a cluster
is completely
down, bring
up name node
Copy all the
backup files
from the
secondary name
node
Restore name
node
operations
In Hadoop 2.0
Functionalities of hadoop components
35. December 1, 2017 35
Functionalities of hadoop components
Data Node Functionality
• Job is a collection of programs or a single program which is
going to be operating on a piece of data
•On each of the data nodes, a software service called the task
tracker runs continuously
•Data node stores the big data whenever a job is submitted
•Manipulates data before executing on the data node
•The decision of which program will be executed by which data
node is taken by job tracker
36. December 1, 2017 36
Job Submission and Execution
Job Submission and Execution in Hadoop cluster
How is a job submitted into the Hadoop cluster?
How exactly would the job get executed?
Imagine you are working as a data engineer or a data scientist in
your team and usually work on a desktop or a laptop
•Hadoop is installed in Pseudo Distributed Mode for all testing and
development purposes
•The Java files are compiled into JAR files, or Java MapReduce files
•These Java files or the JAR files are submitted into the cluster as a job
37. Job Client/ Name node Job tracker
Gateway Machine
Job Submission and Execution in Hadoop cluster
Job Client/Gateway Machine
•This job client is not exactly a part of the cluster
•Hadoop services are not running on it
•Configured to communicate with the name node and the job
tracker
•Job configuration details(.jar)
Input file path
Output file path
Job Submission and Execution
38. December 1, 2017 38
Name Node
•Job is picked up by a name node
•Provides information:
Blocks corresponding to the input files
Programs where work is residing
Job Tracker
• Schedule the jobs
• Distribute the job to multiple data nodes on which the input file is
residing
•Result of execution is available in the output path
•User can check status of job
•The status update can be found using the job tracker:
What percentage of the job is being currently completed
Information is available periodically
Job Submission and Execution
39. December 1, 2017 39
Basic HDFS
Basic HDFS
• HDFS stands for Hadoop distributed file system
• File storage component of Hadoop
• Basic architecture of HDFS and Hadoop
• How HDFS stores the file internally
• Failure handling and recovery mechanism
• Rack awareness and block placement strategies
• Role of name node and secondary name node
• When to use HDFS and when not to
Agenda
41. Input file (200
MB)
Block
1
Block 2 Block 3 Block N
• HDFS breaks the user input file (200 MB) into smaller chunks
• Block size is configured by the administrator
• Default split size is 64 MB
• Split size can be configured depending upon the requirement
HDFS : Storage inside HDFS cluster
Basic HDFS
42. December 1, 2017 42
File Storage in HDFS
•Client machine tries to communicate with the name node
•Name node gives out the information about the default split size
•Client machine gets an idea of how big each input split will be
•Splitting of the files actually happens in the client machine
•The name node gives out the information about:
The hostname of the IP addresses of the data nodes
Free space to actually store the data
•Client machine directly writes blocks on to data nodes
•The client machine or the gateway machine performs this by
bypassing the name node
•Decision of which block is governed by a specific set of rules
•Decision of which block resides on which data node is not done
randomly
Basic HDFS
43. Data node sends a heartbeat signal to the name node
once in
every 3 seconds to indicate that it is up and running
Heartbeat sent every 3 seconds
Data node
Data
serving
Data
node
Data
node
Data
node
Data
node
HDFS
Client
Name
node
Secondary
name node
Namespace
backup
Nodes write to local disk
Design and Architecture Overview
Overview of HDFS
44. • If the data node fails to set the signal once in every 3
seconds:
Name node assumes that particular data node is
dead
Takes actions for replicating the data
Data node
Data
serving
Data
node
Data
node
Data
node
Data
node
HDFS
Client
Name
node
Secondary
name node
Namespace
backupHeartbeat
not received
Nodes write to local disk
Overview of HDFS
45. December 1, 2017 45
•Data node also sends status information to the name node
once in every 6 hours
•This value can be configured to a different number by the
Hadoop
Administrator
•Gives information or the block status report of the data node
•Complete detail information about what block going to exist
on that particular data node
Overview of HDFS
46. Name
node
Secondary
name node
Job tracker
Rack View of Hadoop Cluster
Hadoop cluster is deployed in a production environment into
multiple racks
• The name node, the secondary name node, and job
tracker are never placed in a single rack
• In the event of failure of rack, the entire cluster would
be down
Rack view of Hadoop Cluster
48. December 1, 2017 48
Replication of Data Blocks
• To avoid loss of data, copies of the data blocks on data nodes
is stored on multiple data nodes
• Default replication factor is 3
• 3 copies of the same data block on 3 different data nodes
• Can be configured by administrator
• Replication factor should not be greater than 3 to avoid
consuming a lot of hard drive space
Data Block Replication
50. December 1, 2017 50
Block placement Strategy
Block Placement Strategy/Replica Placement
same rack
• Two racks :
Rack 1 in the left hand side
Rack 2 in the right hand side
• First replica of the block is placed
in one of the data nodes in the left
hand side or the rack 1
• Two other replicas of same block
is split across multiple data nodes
but in a same rack
51. Data Replication on Failure
DN 1
DN 2
DN 3
DN 4
DN 5
DN 7
DN 8
DN 9
DN 11
DN 10 N2
N1
N4
N3
N3
N4
N2
N1
System wide replication factor = 2
• In event of data node failure:
Data node goes down for the
replication count
Replication factor for block
N2 is reduced to 1
• HDFS replicates the block
N2 into some other data
node
• Example: N2 is replicated
into data node 12
N2DN 12DN 6
Block placement Strategy
52. Data Replication on Failure
DN 1
DN 2
DN 3
DN 4
DN 5
DN 7
DN 8
DN 9
DN 10
DN 11
N2
N1
N4
N3
N3
N4
N2
N1
System wide replication factor = 2
What if?
• Data node 10 comes
up after sometime
• Data node 10 was
temporarily down
• Data node actually comes
back after 2 minutes
• The name node or HDFS has
already replicated the block N2
into some other data node
• HDFS deletes one of the extra
copies of N2 and can happen
from any of the nodes
N2DN 12DN 6
Block placement Strategy
53. December 1, 2017 53
Basic HDFS
When to/not to use HDFS?
•Storing large files order of gigabytes, terabytes and petabytes
•Input file size is greater than the input split size
Do not use HDFS
•Storing large number of small files
•High I/O latency when the data is written/read to/from disc
•Input file size is smaller than the input split size
Use HDFS
54. December 1, 2017 54
WORM - write once read many times patterns
Files cannot be edited/changed
File deleted and retrieved back into the local file
system
Edited and then put back into the HDFS data node
In HDFS
Basic HDFS
55. Master
node
• Name node with one cluster
• Manages the entire file system
• Namespace of the metadata of file blocks
• Controls the read write access to the files
• Manages the block replication
• Single point of failure
Architectural Overview of Hadoop 1.0
Master
Secondar
y name
node
Data node Data node Data node
Slav
e
Name
node
HDFS
Architectural Overview of Hadoop 1.0
56. Master
Secondar
y name
node
Data node Data node Data node
Slav
e
Name
node
• Secondary name node is HDFS namespace
backup
• One for the cluster
• Performs the housekeeping work
• Similar hardware as that of name node
machine
• Not used for a hot standby or a highly
available name node backup
• Uses system metadata and namespace
recovery
Architectural Overview of Hadoop 1.0
HDFS
Cluster
Secondary
name node
Architectural Overview of Hadoop 1.0
57. 57
Secondary Name Node
•Heavy weight lifting nodes in cluster or data nodes
•Stores data
•Aids in data processing
•Serves read write request from clients
•Stores and retrieves data blocks
•Performs replication tasks upon requests by the name
node
•Reports block status of system to name node
HDFS client
•HDFS clients can be many
•Act as an interface between the end user and the
Hadoop cluster
•Help to communicate to the name node and data nodes
•Help to submit job
•Submit a read-write request to a file
•Interface with the name node
Architectural Overview of Hadoop 1.0
58. HDFS Namespace
•Hierarchy of files and directories
•Represented by name node data structures called as I-
nodes
•Record the attributes of a file
Permission
Access time
Namespace
Disc space quota
Metadata file maintains file
attributes:
Access time
Replication
factor
Stored persistently in
a local disc and is
called fsImage
Architectural Overview of Hadoop 1.0
59. • Edit log file records every change that occurs to file
system metadata
• Metadata saved in RAM for faster access
• Edit logs are merged with metadata periodically
• This operation of merging is known as checkpointin
• After each checkpoint operation:
Edit logs are cleared
A new entry is added
• Merging fsImage with edit logs is done in secondary
name node
• fsImage file not updated for every write operation
• fsImage is loaded into RAM at every node startup
• Every 1 hour, contents of RAM are flushed out
Architectural Overview of Hadoop 1.0
60. Checkpointing Process
• Happens in the secondary name node
• Copy of fsimage is kept in the RAM
• HDFS file system changes are captured in the edit logs
• 'fsimage' loaded as metadata is optimized for read
operations and fast searching
• Same data corresponding to the edits are captured in
edit logs
• Edit logs and 'fsimage' need to be merged periodically
• New copy of the 'fsimage' contents is reordered into the
main memory
Architectural Overview of Hadoop 1.0
61. Name node
Data node Data node Data node
HDF
S
Clien
t
FS_Data_Input_Strea
m
DFS_Input_Stream
Distributed File
System1
4
7
2 Metadata
Requestto get block location
3 Metadata Flow
6 Rea
dData
Flow
5 Rea
d
Client JVM
Client node
HDFS Dataflow: File Read Operation
Understanding the steps involved in reading a file from
HDFS, an anatomy of a file read operation
HDFS Dataflow Anatomy
62. Steps involed in reading a file from HDFS
1.The client opens the file to be read by calling OPEN
Distributed File System object
2.The object connects to the name node using RPC to get the
metadata information:
3.For each block, name node returns data nodes addresses
having a copy of that block
4.Distributed File System returns object which takes care of
data node and name node interactions
5.Client calls Read operation on streams to connect to first
data node for the first block in the file
6.The data is streamed from the data node back to the client
which calls the read repeatedly until it completes the reading
of the file
7.When client has finished reading, it calls ˄Close operation˅
on FS_Data_Input_Stream
HDFS Dataflow Anatomy
63. 12 Complete
HDFS Dataflow: Anatomy of a File Write Operation
Name node
Data node Data node
HDF
S
Clien
t
Distributed File
System
1
3
2 Creat
e
Ack Queue Client
JVM
Client node
5
Data
Streamer
4
Data Queue
Writing packet
8 8
10 10
11
7
10 Sending Acknowledgement packet
Data node
FS_Data_Output_Strea
m
DFS_Output_Stream
Data node pipeline6
Understanding the steps involved in writing a file from HDFS, an
anatomy of a file write operation
HDFS Dataflow Anatomy
64. 1. The client calls Create API on the distributed file system
object to create a file
2. Object connects to the name node using an RPC call. Creates
a new file in the file system s name with no blocks˅
associated
3. Client calls a Write API on the data
4. DFS_Output_Stream object splits the data into package and
writes into the internal Data Queue
5. Asks name node for allocation of new blocks by picking the
desirable data nodes to store the replicas
6. List of 3 data nodes form a pipeline
7. Data Steamer pours the packet into the first data node in the
pipeline
8. Data Steamer pours the packet into the first data node in the
pipeline
9. DFSOutputStream keeps the Ack Queue to store package
that are waiting to be acknowledged by the data nodes
10.Sending Acknowledgement packet
11.When client finishes writing data, it calls the ˄Close API on˅
the data stream
HDFS Dataflow Anatomy
65. 65
Architecture of MapReduce in Hadoop 1.0
Pig Hive
Java
Map Reduce
(Resource
Management)
+
Job Processing
HDFS (Storage)
Hadoop 1.X
Jobs submitted to a Hadoop 1.0 cluster get converted to
MapR jobs
Hadoop Architecture
70. December 1, 2017 70
One Resource Manager (RM) per cluster
The ResourceManager is the rack-aware master node in YARN
Works like an optimised JobTracker (JT)
In YARN, JT is split into two daemons with the RM
Scheduler
Applications Manager (AM)
The Scheduler component of the YARN ResourceManager
allocates resources to running applications.
ResourceManager is the master that arbitrates all the available
cluster resources and thus helps manage the distributed
applications running on the YARN system.
It works together with the per-node NodeManagers and the
per-application ApplicationMaster.
Hadoop Architecture
71. December 1, 2017 71
Scheduler
aApplications
Manager(AM)
Resource Manager
Application Manager
• Job queue
• Resource list
• Job Scheduling
• Resource allocation
Each time a new job is submitted by a client, it first has to
pass through the application manager
Maintains log of finished jobs
Validates job application requests and rejects those that
violate specifications.
Eliminates duplicate job applications.
Hadoop Architecture
72. HADOOP 1.0 HADOOP 2.0
Scalability
Maximum cluster size:
4,000 nodes
Maximum # of
concurrent tasks (1000+
mappers and reducers
running in parallel):
40,000
JobTracker bottleneck:
gets choked up when
there’s a lot of traffic (no
room for an additional
JobTracker)
6,000-10,000 machine
clusters
100,000+ concurrent
tasks &10,000
concurrent jobs (1
job=1000+ tasks)
Instead of JobTracker,
it has a backup
Resource Manager. It
allows load distribution
within the tracker.
Hadoop 1.0 Vs Hadoop 2.0
Hadoop 1.0 Vs Hadoop 2.0
73. HADOOP 1.0 HADOOP 2.0
Multitenancy
No support for non-
map/reduce
jobs
Designed for batch
processing workloads
Iterative jobs (e.g. for
Machine Learning), not
supported
Can’t accommodate
third-party frameworks
Only MapReduce app
can be
YARN supports both
batch processing and
non-batch oriented jobs.
Supports TEZ, which is
a parallel processing
engine that supports
interactive and iterative
jobs useful for Machine
Learning algorithms
Hadoop 1.0 Vs Hadoop 2.0
74. HADOOP 1.0 HADOOP 2.0
Availability
Single point of failure,
i.e. NameNode
When NameNode
crashes, cluster goes
down
Jobs need to be re-
submitted by users
The cluster is not highly
available
Active/Standby NN which
works in Hot Standby
Mode i.e. Secondary
NameNode will kick in,
when cluster is still
running.
If both Primary and Hot
Standby NameNodes go
down (which is rare), you
can resort to the
Secondary NameNode.
90% chance of both
NameNodes crashing
simultaneously.
Hadoop 1.0 Vs Hadoop 2.0
75. HADOOP 1.0 HADOOP 2.0
JobTracker: Gets choked
up from traffic.
Responsible for scheduling
and centralized resource
allocation in Master mode.
TaskTracker: doing heavy
lifting in the DataNodes
Resource Manager is like the
JobTracker
Consists of a) Scheduler
that schedules activities &
and b) an Application
Manager (not Master) for,
resource allocation and
monitoring.
Application Master:
equivalent of TaskTracker in
MR v1. Responsible for task
execution and updation..
Hadoop 1.0 Vs Hadoop 2.0
76. December 1, 2017 76
Hadoop 3.x
•Apache Hadoop 3 is round the corner with members of the
Hadoop community at Apache Software Foundation still testing
it.
•Apache Hadoop 3.0 will bring in with thousands of new bug
fixes, features and enhancements over Hadoop 2.0.
•The major release of Hadoop 3.x is anticipated to be rolled out
sometime mid of 2017.
Why hadoop 3.x?
•With Java 7 attaining end of life in 2015, there was a need to
revise the minimum runtime version to Java 8 with a new
Hadoop release so that the new release is supported by Oracle
with security fixes and also will allow hadoop to upgrade its
dependencies to modern versions.
Overview of Hadoop 3.0
77. December 1, 2017 77
• With Hadoop 2.0 shell scripts were difficult to understand as
hadoop developers had to read almost all the shell scripts to
understand what is the correct environment variable to set an
option and how to set it whether it is java.library.path or java
classpath or GC options.
• With support for only 2 NameNodes, Hadoop 2 did not provide
maximum level of fault tolerance but with the release of Hadoop
3.x there will be additional fault tolerance as it offers multiple
NameNodes.
• Replication is a costly affair in Hadoop 2 as it follows a 3x
replication scheme leading to 200% additional storage space
and resource overhead. Hadoop 3.0 will incorporate Erasure
Coding in place of replication consuming comparatively less
storage space whilst providing same level of fault tolerance.
Overview of Hadoop 3.0
78. December 1, 2017 78
What’s New in Hadoop 3.0?
•Minimum Runtime Version for Hadoop 3.0 is JDK 8
•Support for Erasure Coding in HDFS
•Hadoop Shell Script Rewrite
•MapReduce Task Level Native Optimization
•Support for Multiple NameNodes to maximize Fault Tolerance
• Introducing a More Powerful YARN in Hadoop 3.0
•Change in Default Ports for Various Services and Addition of New
Default Ports
Overview of Hadoop 3.0
79. December 1, 2017 79
Hadoop 2.x vs. Hadoop 3.x
Features Hadoop 2.x Hadoop 3.x
Minimum
Required
Java
Version
JDK 6 and above.
JDK 8 is the minimum
runtime version of JAVA
required to run Hadoop
3.x as many dependency
library files have been
used from JDK 8.
Fault
Tolerance
Fault Tolerance is
handled through
replication leading to
storage and network
bandwidth overhead.
Support for Erasure
Coding in HDFS improves
fault tolerance
Hadoop 2.0 Vs Hadoop 3.0
80. 80
Features Hadoop 2.x Hadoop 3.x
Storage
Scheme
Follows a 3x Replication
Scheme for data
recovery leading to
200% storage
overhead. For instance,
if there are 8 data
blocks then a total of
24 blocks will occupy
the storage space
because of the 3x
replication scheme
Storage overhead in
Hadoop 3.0 is reduced to
50% with support for
Erasure Coding. In this
case, if here are 8 data
blocks then a total of only
12 blocks will occupy the
storage space
Change in
Port
Numbers
Hadoop HDFS NameNode
-8020
Hadoop HDFS DataNode
-50010
Secondary NameNode
HTTP -50091
Hadoop HDFS NameNode
-9820
Hadoop HDFS DataNode
-9866
Secondary NameNode HTTP
-9869
Hadoop 2.0 Vs Hadoop 3.0
81. December 1, 2017 81
Features Hadoop 2.x Hadoop 3.x
YARN
Timeline
Service
YARN timeline service
introduced in Hadoop 2.0
has some scalability issues.
YARN Timeline service has
been enhanced with ATS v2
which improves the
scalability and reliability.
Intra
DataNode
Balancing
HDFS Balancer in Hadoop
2.0 caused skew within a
DataNode because of
addition or replacement of
disks.
Intra DataNode Balancing
has been introduced in
Hadoop 3.0 to address the
intra-DataNode skews
which occur when disks are
added or replaced.
Number of
NameNodes
Hadoop 2.0 introduced a
secondary namenode as
standby.
Hadoop 3.0 supports 2 or
more NameNodes.
Hadoop 2.0 Vs Hadoop 3.0
83. December 1, 2017 83
5) Configuring SSH
$ ssh-keygen -t rsa -P ""
note: Getting this line (Enter file in which to save the key
(/home/manju/.ssh/id_rsa): ) please enter “ENTER key” in
keyboard
6) Copy id_rsa.pub to authorized keys
$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
7) Disabling IPv6
For getting your IPv6 disable in your Linux machine, you need to
update /etc/sysctl.conf by adding following line of codes at end of
the file
Hadoop Installation
84. December 1, 2017 84
$ sudo gedit /etc/sysctl.conf
note: type above command in terminal you will get one sysctl.conf file,
put below 4 lines in that file
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
8) Now go and download Hadoop tar.gz file in below given link
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-
2.6.4/hadoop-2.6.4.tar.gz
or use
http://mirror.fibergrid.in/apache/hadoop/common/hadoo
p-2.6.4/hadoop-2.6.4.tar.gz
9) now create one apache folder in home directory then go
to download folder copy that Hadoop newly downloaded file and
copy that file into apache folder
Hadoop Installation
85. December 1, 2017 85
10) extract Hadoop zip file in same directory
note: for extract purpose, select Hadoop tar file, right click on tar
file then you can see the option like extract here option choose
that option it will extract automatically in same folder.
11) Then create 2 new folders inside Hadoop directory
i) folder names are yarn inside yarn hdfs directory inside
hdfs namenode directory and datanode directory.
The folder structure like this:
/home/manju/apache/hadoop/yarn/hdfs/namenode
/home/manju/apache/hadoop/yarn/hdfs/datanode
12) Give permissions for newly created directories
$ chmod 777 -R /home/manju/apache/hadoop/yarn
Hadoop Installation
86. December 1, 2017 86
13) Update Hadoop configuration files
$ sudo gedit .bashrc
following environment variables at the end of bashrc file
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 (Change the
path according to your pc configuration)
export HADOOP_HOME=/home/manju/apache/hadoop (Change the
path according to your pc configuration)
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #
Note: After Configure above Variables just refresh the bashrc for that…
$ source .bashr
Hadoop Installation
87. December 1, 2017 87
14) change the setting in Hadoop-env.sh
go to hadoop installed directory then open etc directory
then hadoop folder then open hadoop-env.sh then edit or
paste java home path available in /usr/lib/jvm/java-8-openjdk-
amd64
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
15) change the setting in core-site.xml file
go to Hadoop installed directory then open etc. directory
then Hadoop folder then open core-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Hadoop Installation
88. December 1, 2017 88
16) change the setting in hdfs-site.xml file
go to Hadoop installed directory then open etc directory
then hadoop folder then open hdfs-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/manju/apache/hadoop/yarn/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/manju/apache/hadoop/yarn/hdfs/datanode</value>
</property>
Note: here you have to change your directory structure according to
your pc, which one we created earlier directories
are namenode and datanode
Hadoop Installation
89. December 1, 2017 89
17) change the setting in yarn-site.xml file
go to Hadoop installed directory then open etc directory
then Hadoop folder then open yarn-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Hadoop Installation
90. December 1, 2017 90
18) change the setting in mapred-site.xml file
note: Copy template of mapred-site.xml.template file, then paste in
same directory, rename that copied file into mapred-site.xml
go to Hadoop installed directory then open etc. directory
then Hadoop folder then open mapred-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
19) Format namenode
$ cd apache/hadoop/ (then press enter it will go to Hadoop home
directory)
manju@ubuntu:~/apache/hadoop$ hdfs namenode -format (use
this command to format hdfs)
Hadoop Installation
91. December 1, 2017 91
20) After Format completes run these 2 commands to start Hadoop
$ start-dfs.sh
$ start-yarn.sh
note: when you run the above commands it will ask (yes/no) just give
"yes" for that
21) finally check whether Hadoop working or not
$ jps
Note: its show total 6 Daemons in terminal
manju@ubuntu:~/apache/hadoop$ jps
2337 NameNode
3094 NodeManager
3127 Jps
2986 ResourceManager
2443 DataNode
2845 SecondaryNameNode
22) To stop Hadoop use
$ stop-dfs.sh
$ stop-yarn.sh
Hadoop Installation