SlideShare a Scribd company logo
1 of 92
December 1, 2017 www.snipe.co.in 1
Snipe Team
December 1, 2017 2
Introduction to BigData
December 1, 2017
1000 B= Kilobyte
1000² B= Megabyte
1000³ B= Gigabyte
1000 B= Terabyte⁴
1000 B= Petabyte⁵
1000 B= Exabyte⁶
1000 B= Zetabyte⁷
1000 B= Yottabyte⁸
Rise of bytes
Introduction to BigData
December 1, 2017 4
Types of Data
Data is classified into 3 types:
•Structured Data
•Unstructured Data
•Semi-structured Data
Structured Data
•Fits into the world of RDBMS
•Data is perfectly aligned in rows and columns
•A tabular format is used for representing data
Example: Data storage in MySql Database
Introduction to BigData
December 1, 2017 5
Unstructured Data
•No definite structure can be assigned to this data
•Cannot tabulate the data
•Cannot put in rows and columns
•Cannot fixed into any schema
Example: Text files, PDF document, Web server logs, Text,
Photos, Voice
Semi-structured Data
•Data which is between structured and Unstructured
•Unstructured data embedded within some structures
or tags or schema
Example: XML file
Introduction to BigData
December 1, 2017 6
Big Data Sources:
•New York Stock Exchange 4 to 5 TB of data per day
•Internet stores 18.5 petabytes of data on a day
•Twitter handles 12 TB of tweets everyday
•Facebook :
 1.5 billion active users monthly
 300 PB of user data
 10 billion messages per day
Big Data – The buzz word
•80 to 90% of data is unstructured and that is cannot be fitted
into RDBMS based systems.
•Big Data is Efficient, Economical and Quicker
Introduction to BigData
December 1, 2017 7
Characteristics of BigData:
(3 V’s is now 6 V’s)
6V’s
Volume
Velocity
Variety
Varacity
Value
Variability
Introduction to BigData
December 1, 2017 8
Volume: Sheer bulk of data being generated.
Velocity: Rate at which data is generated
Ex: 5 to 10 TB of data is uploaded on YouTube every 10 mins.
Variety: 80 to 90% of Big Data is unstructured and semi-structured
Ex: Text data, Voice data, Sensor data etc…
Veracity: Uncertainity or correctness of data
Ex: Collecting data from sensor
Variability: Inconsistencies in the rate at which data is generated
Value: What is the value of proposition?
What is business value that it makes?
The output should be worth the investments made to analyze data.
Introduction to BigData
December 1, 2017 9
Use Cases of Big Data
Financial Services
•Fraud detection
•Personalized banking services
Health Care
•Analysis previously restricted to the major players
due to expensive tools and technologies
•Now since the advent of big data the technology is not very
expensive
•Many players in health care segment will start working on big
data
Introduction to BigData
December 1, 2017 10
Retail Industry
•Oldest consumers of big data
•Were using data warehousing techniques very heavily
•Now slowly shifting to big data technology
Web and Social Media Analytics
•Most recent entrant
•Heavily looking into big data related research works for
 Behavioral
 Social Analytics
Introduction to BigData
December 1, 2017 11
RDBMS and BIG DATA
Benefits of RDBMS:
•Compatibility
• Flexibility
•Simplicity
•Performance
•Robustness
Well known RDBMS
•Oracle
•Microsoft SQL server
•MySQL
•Teradata
•DB2
Introduction to BigData
December 1, 2017 12
RDBMS:
•Normalization
 Data consistency
 Eliminates data duplication
•Relational databases have to be incredibly complex
internally
 Example: Simple select statement could have hundreds
of query execution parts
 RDBMS determines the execution plan using cost based
algorithms
Introduction to BigData
December 1, 2017 13
Drawbacks of RDBMS:
•New demand is scalability
•New apps being launched: Massive load on storage and
scalability
•Supporting large number of concurrent users
•Dynamic support is needed
•Scaling:
 A new application can go viral overnight, users increase
from zero to million
 Some users are frequent, others never return back
 Seasonal swings can create spikes
 Users need real-time high performance
 Vertical scaling possible, not ready for horizontal scaling
 Single server node
Introduction to BigData
December 1, 2017 14
Today’s demand
Increased workload due to flexibility requirement
Database structures needs to be altered
Example: Started selling televisions
•Database schema defined for television
•Added refrigerator and music system to
catalogue
Introduction to BigData
December 1, 2017 15
Introduction to Parallel computing:
Divide a task and conquer
Computer 1
Computer 3
Computer 2
Computer 4
• Reduced time by Parallel processing
• Faster and quicker
Task
Introduction to BigData
December 1, 2017 16
Challenges of Super Computing:
•General purpose operating system did not exist
•Buyer of super computers locked to vendors for
hardware support
•High initial cost of the Hardware
•High cost of software maintenance and upgrades to be
taken care in-house
•Develop custom software for individual use cases
HADOOP – The rescue
• General purpose operating system like framework
• Built-in rich features set of software tools and components
• Not locked with one vendor, can be installed on any commodity
hardware
• Mid sized organizations can afford
• Free software (open source) with free upgrades
• Distributed computing to wider set of audience
Introduction to BigData
December 1, 2017 17
Hadoop History:
2000-2002
•Project NUTCH
•Open source scalable and robust internet search engine
•Doug Cutting
and Mike
2003-2004
•Big Table
•Map Reduce
•Common features with NUTCH
2006
•Doug cutting joined yahoo and created hadoop
•NUTCH +Big table + Google MapReduce
•Hadoop MapReduce implemented in Java
2008
•Hadoop: Apache project
•Stable version of Hadoop used in Yahoo
Introduction to BigData
December 1, 2017 18
Hadoop
December 1, 2017 19
Overview of Hadoop Architecture
Basic components of Hadoop 1.0
•Name node
•Secondary name node
• Job tracker
•Task tracker
•Data nodes
Job in Hadoop Ecosystem
•A job is some task submitted by the user to the Hadoop cluster
•The job is in the form of a program or collection of programs (a JAR file)
which needs to be executed
December 1, 2017 20
Attributes:
•Programs
•Input data to the program i.e a file or collection of files in a
directory
•Output directory where the results of execution is collected
in a files
Job
•Java mapreduce jobs.
•Programs will be submitted into the cluster in the form of a
JAR file.
•Packaging of all the classes.
•Programs need to executed on the particular set of data.
Overview of Hadoop Architecture
December 1, 2017 21
Overview of Hadoop Architecture
Apache Hadoop Core Features
Hadoop Distributed File System(HDFS)
•When file is submitted into the Hadoop cluster:
It resides on multiple data nodes
Original file divided into smaller pieces
Parallel Processing Framework
•Robust
•Known as a MapReduce Framework
•When you submit the job into the Hadoop cluster:
Program executes on a piece of data
Runs on multiple machines
Cluster of
1000 nodes
Name
node
Secondary name
node
Job
tracker
Remaining 997 machines work in the slave mode and act as
data nodes
Master Slave Architecture
Overview of Hadoop Architecture
Need not have a very high hard drive storage space
•Data nodes need to be very high in terms of hard drive
storage space
•Nodes take the load of all the data
•Big data storage consist of the bulk of data operations
Name node Secondary name node Job tracker
Overview of Hadoop Architecture
December 1, 2017 24
Master Slave Architecture: Host OS
•Hadoop as a piece of software framework is installed on native operating
system
•Installed on all these machines along with the data nodes
•The differentiating factor is the software configuration after installing
Hadoop
•Machines perform responsibilities associated with the name node,
secondary name node and job tracker
Hadoop Cluster Setup
•Machines within a rack communicate with the help of a switch at
the speed of 10 gigabytes per second
•Multiple racks communicate with the help of a multi layer switch or
uplink switch which also acts as a router
•Data transfer speed between machines within the same rack is
higher than the data transfer speed between the machines across
different racks
Overview of Hadoop Architecture
General Specifications of Hadoop Cluster
December 1, 2017 25
Built-in using Commodity Hardware
•Makes the hardware easy to procure and maintain
•Reduces dependency on just one vendor
Processor Built
•Most of the data nodes has two hex-core processor or two
octa-core processor
•2 CPU, each of them at 8 cores
•Processing speed lies anywhere between 2.4
to 3.5 gigahertz CPU
Overview of Hadoop Architecture
December 1, 2017 26
Storage
•Amount of RAM varies according to the organizational needs
•Name node and job tracker would have higher RAM
•Most of the data nodes will be ranging between 50 to 500 GB of
high speed RAM
Thumb Rule
•To decide how much hard drive storage is needed for each data
node
•Every single core of CPU requires at least 2 Terabytes of hard
drive
Overview of Hadoop Architecture
December 1, 2017 27
Hadoop Services
1.Once Hadoop is installed, certain services are enabled
2.Processes or services are associated with the name node running
3.The machine acquires the role of that data node
4.These are set of Hadoop services or set of Hadoop daemons
running
5.A set of software processes or collection of several processes
6.A set of software processes or collection of several processes
Overview of Hadoop Architecture
Standalone Mode
Pseudo Distributed Mode
Three different modes
in which Hadoop can
be installed and
deployed
Fully Distributed Mode
(Cluster Mode)
Hadoop is installed on the cluster of interconnected machines
Hadoop Deployment Modes
December 1, 2017 29
Hadoop Deployment Modes
Hadoop Components: Standalone Mode
This mode is
mainly used
for testing
purposes
These are the software
services which actually run
as a part of your Hadoop
installation
Job tracker Name node
If JVM crashes, all Hadoop
services will also crasha
Secondary name
node
Data
node
It's the least
preferred mode
of Hadoop
installation and
deployment
If all the services are
sharing a single JVM,
this mode is called
the Standalone
Mode
Each of the Hadoop
services run on a
separate JVM
Services run as a
part of Hadoop
installation
Similarity Both
run on a single
machine
Crashing of JVM does not
impact Hadoop cluster
When JVM
crashes, all the
Hadoop services
also crash
Standalone Mode
Pseudo
Distributed Mode
Pseudo
Distribute
d Mode
vs.
Standalon
e Mode
This mode is widely used for learning and development
purpose but not for deployment
Hadoop Components: Pseudo Distributed Mode
Hadoop Deployment Modes
December 1, 2017 31
Hadoop Deployment Mode
Hadoop Components: Fully Distributed Mode
• Used in real production environment
• If job tracker is configured on a machine, the dedicated
machine runs only the job tracker with Hadoop installation
• When the dedicated machine is working as name node, it is
when the particular hardware is running on name node
services
Real VS Pseudo Distributed Mode
• All the Hadoop services are well interconnected but on a
separate JVM in Real Distributed Mode
• All the Hadoop services are on a different JVM but on a single
machine in Pseudo Distributed mode
December 1, 2017 32
Functionalities of hadoop components
Functionalities of Hadoop Components
•Name Node
All the information is available at name node
It is a centralised file namespace server or a file system server
• Secondary Name Node
Helps the name node to backup the data present in the name
node server periodically
In the event of a name node failure, the secondary name node
will be used to recover and restore the name node
Hot standby means
that secondary
name node will work
in a UPS mode
Functionalities of Hadoop Components
When name node is
down, the entire
cluster goes down in
the case of Hadoop 1.0
As soon as power
is down UPS starts
backing up
Receives an
uninterrupted
power supply
mode
Not the case
with Hadoop
2.0
Hadoop 1.0 did not have the provision of hot standby
name node
Functionalities of hadoop components
When a cluster
is completely
down, bring
up name node
Copy all the
backup files
from the
secondary name
node
Restore name
node
operations
In Hadoop 2.0
Functionalities of hadoop components
December 1, 2017 35
Functionalities of hadoop components
Data Node Functionality
• Job is a collection of programs or a single program which is
going to be operating on a piece of data
•On each of the data nodes, a software service called the task
tracker runs continuously
•Data node stores the big data whenever a job is submitted
•Manipulates data before executing on the data node
•The decision of which program will be executed by which data
node is taken by job tracker
December 1, 2017 36
Job Submission and Execution
Job Submission and Execution in Hadoop cluster
How is a job submitted into the Hadoop cluster?
How exactly would the job get executed?
Imagine you are working as a data engineer or a data scientist in
your team and usually work on a desktop or a laptop
•Hadoop is installed in Pseudo Distributed Mode for all testing and
development purposes
•The Java files are compiled into JAR files, or Java MapReduce files
•These Java files or the JAR files are submitted into the cluster as a job
Job Client/ Name node Job tracker
Gateway Machine
Job Submission and Execution in Hadoop cluster
Job Client/Gateway Machine
•This job client is not exactly a part of the cluster
•Hadoop services are not running on it
•Configured to communicate with the name node and the job
tracker
•Job configuration details(.jar)
 Input file path
 Output file path
Job Submission and Execution
December 1, 2017 38
Name Node
•Job is picked up by a name node
•Provides information:
Blocks corresponding to the input files
Programs where work is residing
Job Tracker
• Schedule the jobs
• Distribute the job to multiple data nodes on which the input file is
residing
•Result of execution is available in the output path
•User can check status of job
•The status update can be found using the job tracker:
What percentage of the job is being currently completed
Information is available periodically
Job Submission and Execution
December 1, 2017 39
Basic HDFS
Basic HDFS
• HDFS stands for Hadoop distributed file system
• File storage component of Hadoop
• Basic architecture of HDFS and Hadoop
• How HDFS stores the file internally
• Failure handling and recovery mechanism
• Rack awareness and block placement strategies
• Role of name node and secondary name node
• When to use HDFS and when not to
Agenda
Identification
number
Individual
hard drive
storage
space
Data
nodes
DN
1
DN
2
DN
3
DN
4
DN
5
DN
6
DN
7
DN
8
DN
9
DN
10
DN
11
DN
12
DN 1
HDFS: Storage inside HDFS Cluster
Basic HDFS
Input file (200
MB)
Block
1
Block 2 Block 3 Block N
• HDFS breaks the user input file (200 MB) into smaller chunks
• Block size is configured by the administrator
• Default split size is 64 MB
• Split size can be configured depending upon the requirement
HDFS : Storage inside HDFS cluster
Basic HDFS
December 1, 2017 42
File Storage in HDFS
•Client machine tries to communicate with the name node
•Name node gives out the information about the default split size
•Client machine gets an idea of how big each input split will be
•Splitting of the files actually happens in the client machine
•The name node gives out the information about:
 The hostname of the IP addresses of the data nodes
 Free space to actually store the data
•Client machine directly writes blocks on to data nodes
•The client machine or the gateway machine performs this by
bypassing the name node
•Decision of which block is governed by a specific set of rules
•Decision of which block resides on which data node is not done
randomly
Basic HDFS
Data node sends a heartbeat signal to the name node
once in
every 3 seconds to indicate that it is up and running
Heartbeat sent every 3 seconds
Data node
Data
serving
Data
node
Data
node
Data
node
Data
node
HDFS
Client
Name
node
Secondary
name node
Namespace
backup
Nodes write to local disk
Design and Architecture Overview
Overview of HDFS
• If the data node fails to set the signal once in every 3
seconds:
 Name node assumes that particular data node is
dead
 Takes actions for replicating the data
Data node
Data
serving
Data
node
Data
node
Data
node
Data
node
HDFS
Client
Name
node
Secondary
name node
Namespace
backupHeartbeat
not received
Nodes write to local disk
Overview of HDFS
December 1, 2017 45
•Data node also sends status information to the name node
once in every 6 hours
•This value can be configured to a different number by the
Hadoop
Administrator
•Gives information or the block status report of the data node
•Complete detail information about what block going to exist
on that particular data node
Overview of HDFS
Name
node
Secondary
name node
Job tracker
Rack View of Hadoop Cluster
Hadoop cluster is deployed in a production environment into
multiple racks
• The name node, the secondary name node, and job
tracker are never placed in a single rack
• In the event of failure of rack, the entire cluster would
be down
Rack view of Hadoop Cluster
Input file (200 MB)
N1 (64 MB) N2 (64 MB) N3 (64 MB) N4 (8 MB)
Failure of a Data Node
DN 1
DN 2
DN 3
DN 4
N4 DN 5
DN 6
DN 7
DN 8
DN 9
DN 10
DN 11
DN 12
N2
N1
N3
Data node Failure
December 1, 2017 48
Replication of Data Blocks
• To avoid loss of data, copies of the data blocks on data nodes
is stored on multiple data nodes
• Default replication factor is 3
• 3 copies of the same data block on 3 different data nodes
• Can be configured by administrator
• Replication factor should not be greater than 3 to avoid
consuming a lot of hard drive space
Data Block Replication
File Storage in HDFS
Block size = 64MB
300 MB
Total blocks = 5 RF = 3 Cluster = 5 nodes
5x64 = 320MB
5th block = 44MB
File storage in HDFS
December 1, 2017 50
Block placement Strategy
Block Placement Strategy/Replica Placement
same rack
• Two racks :
 Rack 1 in the left hand side
 Rack 2 in the right hand side
• First replica of the block is placed
in one of the data nodes in the left
hand side or the rack 1
• Two other replicas of same block
is split across multiple data nodes
but in a same rack
Data Replication on Failure
DN 1
DN 2
DN 3
DN 4
DN 5
DN 7
DN 8
DN 9
DN 11
DN 10 N2
N1
N4
N3
N3
N4
N2
N1
System wide replication factor = 2
• In event of data node failure:
 Data node goes down for the
replication count
 Replication factor for block
N2 is reduced to 1
• HDFS replicates the block
N2 into some other data
node
• Example: N2 is replicated
into data node 12
N2DN 12DN 6
Block placement Strategy
Data Replication on Failure
DN 1
DN 2
DN 3
DN 4
DN 5
DN 7
DN 8
DN 9
DN 10
DN 11
N2
N1
N4
N3
N3
N4
N2
N1
System wide replication factor = 2
What if?
• Data node 10 comes
up after sometime
• Data node 10 was
temporarily down
• Data node actually comes
back after 2 minutes
• The name node or HDFS has
already replicated the block N2
into some other data node
• HDFS deletes one of the extra
copies of N2 and can happen
from any of the nodes
N2DN 12DN 6
Block placement Strategy
December 1, 2017 53
Basic HDFS
When to/not to use HDFS?
•Storing large files order of gigabytes, terabytes and petabytes
•Input file size is greater than the input split size
Do not use HDFS
•Storing large number of small files
•High I/O latency when the data is written/read to/from disc
•Input file size is smaller than the input split size
Use HDFS
December 1, 2017 54
 WORM - write once read many times patterns
 Files cannot be edited/changed
 File deleted and retrieved back into the local file
system
 Edited and then put back into the HDFS data node
In HDFS
Basic HDFS
Master
node
• Name node with one cluster
• Manages the entire file system
• Namespace of the metadata of file blocks
• Controls the read write access to the files
• Manages the block replication
• Single point of failure
Architectural Overview of Hadoop 1.0
Master
Secondar
y name
node
Data node Data node Data node
Slav
e
Name
node
HDFS
Architectural Overview of Hadoop 1.0
Master
Secondar
y name
node
Data node Data node Data node
Slav
e
Name
node
• Secondary name node is HDFS namespace
backup
• One for the cluster
• Performs the housekeeping work
• Similar hardware as that of name node
machine
• Not used for a hot standby or a highly
available name node backup
• Uses system metadata and namespace
recovery
Architectural Overview of Hadoop 1.0
HDFS
Cluster
Secondary
name node
Architectural Overview of Hadoop 1.0
57
Secondary Name Node
•Heavy weight lifting nodes in cluster or data nodes
•Stores data
•Aids in data processing
•Serves read write request from clients
•Stores and retrieves data blocks
•Performs replication tasks upon requests by the name
node
•Reports block status of system to name node
HDFS client
•HDFS clients can be many
•Act as an interface between the end user and the
Hadoop cluster
•Help to communicate to the name node and data nodes
•Help to submit job
•Submit a read-write request to a file
•Interface with the name node
Architectural Overview of Hadoop 1.0
HDFS Namespace
•Hierarchy of files and directories
•Represented by name node data structures called as I-
nodes
•Record the attributes of a file
 Permission
 Access time
 Namespace
 Disc space quota
Metadata file maintains file
attributes:
Access time
Replication
factor
Stored persistently in
a local disc and is
called fsImage
Architectural Overview of Hadoop 1.0
• Edit log file records every change that occurs to file
system metadata
• Metadata saved in RAM for faster access
• Edit logs are merged with metadata periodically
• This operation of merging is known as checkpointin
• After each checkpoint operation:
 Edit logs are cleared
 A new entry is added
• Merging fsImage with edit logs is done in secondary
name node
• fsImage file not updated for every write operation
• fsImage is loaded into RAM at every node startup
• Every 1 hour, contents of RAM are flushed out
Architectural Overview of Hadoop 1.0
Checkpointing Process
• Happens in the secondary name node
• Copy of fsimage is kept in the RAM
• HDFS file system changes are captured in the edit logs
• 'fsimage' loaded as metadata is optimized for read
operations and fast searching
• Same data corresponding to the edits are captured in
edit logs
• Edit logs and 'fsimage' need to be merged periodically
• New copy of the 'fsimage' contents is reordered into the
main memory
Architectural Overview of Hadoop 1.0
Name node
Data node Data node Data node
HDF
S
Clien
t
FS_Data_Input_Strea
m
DFS_Input_Stream
Distributed File
System1
4
7
2 Metadata
Requestto get block location
3 Metadata Flow
6 Rea
dData
Flow
5 Rea
d
Client JVM
Client node
HDFS Dataflow: File Read Operation
Understanding the steps involved in reading a file from
HDFS, an anatomy of a file read operation
HDFS Dataflow Anatomy
Steps involed in reading a file from HDFS
1.The client opens the file to be read by calling OPEN
Distributed File System object
2.The object connects to the name node using RPC to get the
metadata information:
3.For each block, name node returns data nodes addresses
having a copy of that block
4.Distributed File System returns object which takes care of
data node and name node interactions
5.Client calls Read operation on streams to connect to first
data node for the first block in the file
6.The data is streamed from the data node back to the client
which calls the read repeatedly until it completes the reading
of the file
7.When client has finished reading, it calls ˄Close operation˅
on FS_Data_Input_Stream
HDFS Dataflow Anatomy
12 Complete
HDFS Dataflow: Anatomy of a File Write Operation
Name node
Data node Data node
HDF
S
Clien
t
Distributed File
System
1
3
2 Creat
e
Ack Queue Client
JVM
Client node
5
Data
Streamer
4
Data Queue
Writing packet
8 8
10 10
11
7
10 Sending Acknowledgement packet
Data node
FS_Data_Output_Strea
m
DFS_Output_Stream
Data node pipeline6
Understanding the steps involved in writing a file from HDFS, an
anatomy of a file write operation
HDFS Dataflow Anatomy
1. The client calls Create API on the distributed file system
object to create a file
2. Object connects to the name node using an RPC call. Creates
a new file in the file system s name with no blocks˅
associated
3. Client calls a Write API on the data
4. DFS_Output_Stream object splits the data into package and
writes into the internal Data Queue
5. Asks name node for allocation of new blocks by picking the
desirable data nodes to store the replicas
6. List of 3 data nodes form a pipeline
7. Data Steamer pours the packet into the first data node in the
pipeline
8. Data Steamer pours the packet into the first data node in the
pipeline
9. DFSOutputStream keeps the Ack Queue to store package
that are waiting to be acknowledged by the data nodes
10.Sending Acknowledgement packet
11.When client finishes writing data, it calls the ˄Close API on˅
the data stream
HDFS Dataflow Anatomy
65
Architecture of MapReduce in Hadoop 1.0
Pig Hive
Java
Map Reduce
(Resource
Management)
+
Job Processing
HDFS (Storage)
Hadoop 1.X
 Jobs submitted to a Hadoop 1.0 cluster get converted to
MapR jobs
Hadoop Architecture
(YARN
)
Resourc
e
Manager
Job
Schedule
r
3rd party
framework,
plugs into
YARN
Critical for
Machine
Learning
algorithms
Allows Spark
to plug into
Hadoop
Primary & Secondary
NameNodes
Hot Standby/Highly
Available NameNode Allows Spark
to plug into
Hadoop
49
Hadoop 2.0 YARN
Hadoop Architecture
December 1, 2017 67
Hadoop 2.0 YARN advantages over Hadoop 1.0
Hadoop Architecture
December 1, 2017 68
HADOOP 2.X Core COMPONENTS
HDFS YARN
Node Manager
Resource ManagerName Node
Data Node
Secondary
Name Node
Storage Processing
Master
Slave
Hadoop 2.0 Core Components
Hadoop Architecture
December 1, 2017 69
Clientjjds
Scheduler
aApplications
Manager(AM)
CContainer
App
Master
CContainer
App
Master
Resource Manager
Node Manager Node Manager
Data Node Data Node
Resource Manager
Master
Slave
Hadoop Architecture
December 1, 2017 70
 One Resource Manager (RM) per cluster
 The ResourceManager is the rack-aware master node in YARN
 Works like an optimised JobTracker (JT)
 In YARN, JT is split into two daemons with the RM
 Scheduler
 Applications Manager (AM)
 The Scheduler component of the YARN ResourceManager
allocates resources to running applications.
 ResourceManager is the master that arbitrates all the available
cluster resources and thus helps manage the distributed
applications running on the YARN system.
 It works together with the per-node NodeManagers and the
per-application ApplicationMaster.
Hadoop Architecture
December 1, 2017 71
Scheduler
aApplications
Manager(AM)
Resource Manager
Application Manager
• Job queue
• Resource list
• Job Scheduling
• Resource allocation
Each time a new job is submitted by a client, it first has to
pass through the application manager
 Maintains log of finished jobs
 Validates job application requests and rejects those that
violate specifications.
 Eliminates duplicate job applications.
Hadoop Architecture
HADOOP 1.0 HADOOP 2.0
Scalability
 Maximum cluster size:
4,000 nodes
 Maximum # of
concurrent tasks (1000+
mappers and reducers
running in parallel):
40,000
 JobTracker bottleneck:
gets choked up when
there’s a lot of traffic (no
room for an additional
JobTracker)
 6,000-10,000 machine
clusters
 100,000+ concurrent
tasks &10,000
concurrent jobs (1
job=1000+ tasks)
 Instead of JobTracker,
it has a backup
Resource Manager. It
allows load distribution
within the tracker.
Hadoop 1.0 Vs Hadoop 2.0
Hadoop 1.0 Vs Hadoop 2.0
HADOOP 1.0 HADOOP 2.0
Multitenancy
 No support for non-
map/reduce
jobs
 Designed for batch
processing workloads
 Iterative jobs (e.g. for
Machine Learning), not
supported
 Can’t accommodate
third-party frameworks
 Only MapReduce app
can be
 YARN supports both
batch processing and
non-batch oriented jobs.
 Supports TEZ, which is
a parallel processing
engine that supports
interactive and iterative
jobs useful for Machine
Learning algorithms
Hadoop 1.0 Vs Hadoop 2.0
HADOOP 1.0 HADOOP 2.0
Availability
 Single point of failure,
i.e. NameNode
 When NameNode
crashes, cluster goes
down
 Jobs need to be re-
submitted by users
 The cluster is not highly
available
 Active/Standby NN which
works in Hot Standby
Mode i.e. Secondary
NameNode will kick in,
when cluster is still
running.
 If both Primary and Hot
Standby NameNodes go
down (which is rare), you
can resort to the
Secondary NameNode.
90% chance of both
NameNodes crashing
simultaneously.
Hadoop 1.0 Vs Hadoop 2.0
HADOOP 1.0 HADOOP 2.0
 JobTracker: Gets choked
up from traffic.
Responsible for scheduling
and centralized resource
allocation in Master mode.
 TaskTracker: doing heavy
lifting in the DataNodes
 Resource Manager is like the
JobTracker
 Consists of a) Scheduler
that schedules activities &
and b) an Application
Manager (not Master) for,
resource allocation and
monitoring.
 Application Master:
equivalent of TaskTracker in
MR v1. Responsible for task
execution and updation..
Hadoop 1.0 Vs Hadoop 2.0
December 1, 2017 76
Hadoop 3.x
•Apache Hadoop 3 is round the corner with members of the
Hadoop community at Apache Software Foundation still testing
it.
•Apache Hadoop 3.0 will bring in with thousands of new bug
fixes, features and enhancements over Hadoop 2.0.
•The major release of Hadoop 3.x is anticipated to be rolled out
sometime mid of 2017.
Why hadoop 3.x?
•With Java 7 attaining end of life in 2015, there was a need to
revise the minimum runtime version to Java 8 with a new
Hadoop release so that the new release is supported by Oracle
with security fixes and also will allow hadoop to upgrade its
dependencies to modern versions.
Overview of Hadoop 3.0
December 1, 2017 77
• With Hadoop 2.0 shell scripts were difficult to understand as
hadoop developers had to read almost all the shell scripts to
understand what is the correct environment variable to set an
option and how to set it whether it is java.library.path or java
classpath or GC options.
• With support for only 2 NameNodes, Hadoop 2 did not provide
maximum level of fault tolerance but with the release of Hadoop
3.x there will be additional fault tolerance as it offers multiple
NameNodes.
• Replication is a costly affair in Hadoop 2 as it follows a 3x
replication scheme leading to 200% additional storage space
and resource overhead. Hadoop 3.0 will incorporate Erasure
Coding in place of replication consuming comparatively less
storage space whilst providing same level of fault tolerance.
Overview of Hadoop 3.0
December 1, 2017 78
What’s New in Hadoop 3.0?
•Minimum Runtime Version for Hadoop 3.0 is JDK 8
•Support for Erasure Coding in HDFS
•Hadoop Shell Script Rewrite
•MapReduce Task Level Native Optimization
•Support for Multiple NameNodes to maximize Fault Tolerance
• Introducing a More Powerful YARN in Hadoop 3.0
•Change in Default Ports for Various Services and Addition of New
Default Ports
Overview of Hadoop 3.0
December 1, 2017 79
Hadoop 2.x vs. Hadoop 3.x
Features Hadoop 2.x Hadoop 3.x
Minimum
Required
Java
Version
JDK 6 and above.
JDK 8 is the minimum
runtime version of JAVA
required to run Hadoop
3.x as many dependency
library files have been
used from JDK 8.
Fault
Tolerance
Fault Tolerance is
handled through
replication leading to
storage and network
bandwidth overhead.
Support for Erasure
Coding in HDFS improves
fault tolerance
Hadoop 2.0 Vs Hadoop 3.0
80
Features Hadoop 2.x Hadoop 3.x
Storage
Scheme
Follows a 3x Replication
Scheme for data
recovery leading to
200% storage
overhead. For instance,
if there are 8 data
blocks then a total of
24 blocks will occupy
the storage space
because of the 3x
replication scheme
Storage overhead in
Hadoop 3.0 is reduced to
50% with support for
Erasure Coding. In this
case, if here are 8 data
blocks then a total of only
12 blocks will occupy the
storage space
Change in
Port
Numbers
Hadoop HDFS NameNode
-8020
Hadoop HDFS DataNode
-50010
Secondary NameNode
HTTP -50091
Hadoop HDFS NameNode
-9820
Hadoop HDFS DataNode
-9866
Secondary NameNode HTTP
-9869
Hadoop 2.0 Vs Hadoop 3.0
December 1, 2017 81
Features Hadoop 2.x Hadoop 3.x
YARN
Timeline
Service
YARN timeline service
introduced in Hadoop 2.0
has some scalability issues.
YARN Timeline service has
been enhanced with ATS v2
which improves the
scalability and reliability.
Intra
DataNode
Balancing
HDFS Balancer in Hadoop
2.0 caused skew within a
DataNode because of
addition or replacement of
disks.
Intra DataNode Balancing
has been introduced in
Hadoop 3.0 to address the
intra-DataNode skews
which occur when disks are
added or replaced.
Number of
NameNodes
Hadoop 2.0 introduced a
secondary namenode as
standby.
Hadoop 3.0 supports 2 or
more NameNodes.
Hadoop 2.0 Vs Hadoop 3.0
December 1, 2017 82
Hadoop Installation:
1)update Ubuntu
$ sudo apt-get update
2) Download and Install JDK
$ sudo apt-get install default-jdk
Reference link:
https://www.digitalocean.com/community/tutorials/how-to-install-ja
3) Check java Installed or Not
$ java -version
4) Install SSH
$ sudo apt-get install openssh-server
Hadoop Installation
December 1, 2017 83
5) Configuring SSH
$ ssh-keygen -t rsa -P ""
note: Getting this line (Enter file in which to save the key
(/home/manju/.ssh/id_rsa): ) please enter “ENTER key” in
keyboard
6) Copy id_rsa.pub to authorized keys
$ cat $HOME/.ssh/id_rsa.pub >>
$HOME/.ssh/authorized_keys
7) Disabling IPv6
For getting your IPv6 disable in your Linux machine, you need to
update /etc/sysctl.conf by adding following line of codes at end of
the file
Hadoop Installation
December 1, 2017 84
$ sudo gedit /etc/sysctl.conf
note: type above command in terminal you will get one sysctl.conf file,
put below 4 lines in that file
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
8) Now go and download Hadoop tar.gz file in below given link
http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-
2.6.4/hadoop-2.6.4.tar.gz
or use
http://mirror.fibergrid.in/apache/hadoop/common/hadoo
p-2.6.4/hadoop-2.6.4.tar.gz
9) now create one apache folder in home directory then go
to download folder copy that Hadoop newly downloaded file and
copy that file into apache folder
Hadoop Installation
December 1, 2017 85
10) extract Hadoop zip file in same directory
note: for extract purpose, select Hadoop tar file, right click on tar
file then you can see the option like extract here option choose
that option it will extract automatically in same folder.
11) Then create 2 new folders inside Hadoop directory
i) folder names are yarn inside yarn hdfs directory inside
hdfs namenode directory and datanode directory.
The folder structure like this:
/home/manju/apache/hadoop/yarn/hdfs/namenode
/home/manju/apache/hadoop/yarn/hdfs/datanode
12) Give permissions for newly created directories
$ chmod 777 -R /home/manju/apache/hadoop/yarn
Hadoop Installation
December 1, 2017 86
13) Update Hadoop configuration files
$ sudo gedit .bashrc
following environment variables at the end of bashrc file
# -- HADOOP ENVIRONMENT VARIABLES START -- #
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 (Change the
path according to your pc configuration)
export HADOOP_HOME=/home/manju/apache/hadoop (Change the
path according to your pc configuration)
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export
HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
# -- HADOOP ENVIRONMENT VARIABLES END -- #
Note: After Configure above Variables just refresh the bashrc for that…
$ source .bashr
Hadoop Installation
December 1, 2017 87
14) change the setting in Hadoop-env.sh
go to hadoop installed directory then open etc directory
then hadoop folder then open hadoop-env.sh then edit or
paste java home path available in /usr/lib/jvm/java-8-openjdk-
amd64
# The java implementation to use.
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
15) change the setting in core-site.xml file
go to Hadoop installed directory then open etc. directory
then Hadoop folder then open core-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
Hadoop Installation
December 1, 2017 88
16) change the setting in hdfs-site.xml file
go to Hadoop installed directory then open etc directory
then hadoop folder then open hdfs-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/manju/apache/hadoop/yarn/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/manju/apache/hadoop/yarn/hdfs/datanode</value>
</property>
Note: here you have to change your directory structure according to
your pc, which one we created earlier directories
are namenode and datanode
Hadoop Installation
December 1, 2017 89
17) change the setting in yarn-site.xml file
go to Hadoop installed directory then open etc directory
then Hadoop folder then open yarn-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Hadoop Installation
December 1, 2017 90
18) change the setting in mapred-site.xml file
note: Copy template of mapred-site.xml.template file, then paste in
same directory, rename that copied file into mapred-site.xml
go to Hadoop installed directory then open etc. directory
then Hadoop folder then open mapred-site.xml then edit
using gedit tool or Paste these lines into <configuration> tag
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
19) Format namenode
$ cd apache/hadoop/ (then press enter it will go to Hadoop home
directory)
manju@ubuntu:~/apache/hadoop$ hdfs namenode -format (use
this command to format hdfs)
Hadoop Installation
December 1, 2017 91
20) After Format completes run these 2 commands to start Hadoop
$ start-dfs.sh
$ start-yarn.sh
note: when you run the above commands it will ask (yes/no) just give
"yes" for that
21) finally check whether Hadoop working or not
$ jps
Note: its show total 6 Daemons in terminal
manju@ubuntu:~/apache/hadoop$ jps
2337 NameNode
3094 NodeManager
3127 Jps
2986 ResourceManager
2443 DataNode
2845 SecondaryNameNode
22) To stop Hadoop use
$ stop-dfs.sh
$ stop-yarn.sh
Hadoop Installation
December 1, 2017 92

More Related Content

What's hot

Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explainedconfluent
 
Monitoring kubernetes with prometheus
Monitoring kubernetes with prometheusMonitoring kubernetes with prometheus
Monitoring kubernetes with prometheusBrice Fernandes
 
virtualization-vs-containerization-paas
virtualization-vs-containerization-paasvirtualization-vs-containerization-paas
virtualization-vs-containerization-paasrajdeep
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQLkristinferrier
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introductionchrislusf
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...Vietnam Open Infrastructure User Group
 
Docker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshopDocker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshopSathish VJ
 
Object Storage Overview
Object Storage OverviewObject Storage Overview
Object Storage OverviewCloudian
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetesRishabh Indoria
 
Introduction to Docker Compose
Introduction to Docker ComposeIntroduction to Docker Compose
Introduction to Docker ComposeAjeet Singh Raina
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideBytemark
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 

What's hot (20)

بیگ دیتا
بیگ دیتابیگ دیتا
بیگ دیتا
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals ExplainedApache Kafka Architecture & Fundamentals Explained
Apache Kafka Architecture & Fundamentals Explained
 
Monitoring kubernetes with prometheus
Monitoring kubernetes with prometheusMonitoring kubernetes with prometheus
Monitoring kubernetes with prometheus
 
virtualization-vs-containerization-paas
virtualization-vs-containerization-paasvirtualization-vs-containerization-paas
virtualization-vs-containerization-paas
 
Introduction to HiveQL
Introduction to HiveQLIntroduction to HiveQL
Introduction to HiveQL
 
SeaweedFS introduction
SeaweedFS introductionSeaweedFS introduction
SeaweedFS introduction
 
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
Docker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshopDocker and Kubernetes 101 workshop
Docker and Kubernetes 101 workshop
 
Object Storage Overview
Object Storage OverviewObject Storage Overview
Object Storage Overview
 
Introduction to kubernetes
Introduction to kubernetesIntroduction to kubernetes
Introduction to kubernetes
 
Introduction to Docker Compose
Introduction to Docker ComposeIntroduction to Docker Compose
Introduction to Docker Compose
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Docker internals
Docker internalsDocker internals
Docker internals
 
Kubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory GuideKubernetes for Beginners: An Introductory Guide
Kubernetes for Beginners: An Introductory Guide
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 

Similar to Hadoop

Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopAmir Shaikh
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop clusterFurqan Haider
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Denodo
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightNilesh Gule
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Dr. Anita Goel
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBAhmed Farag
 

Similar to Hadoop (20)

Introduction to BIg Data and Hadoop
Introduction to BIg Data and HadoopIntroduction to BIg Data and Hadoop
Introduction to BIg Data and Hadoop
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Big data analysis using hadoop cluster
Big data analysis using hadoop clusterBig data analysis using hadoop cluster
Big data analysis using hadoop cluster
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Getting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsightGetting started with big data in Azure HDInsight
Getting started with big data in Azure HDInsight
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017Big data and cloud computing 9 sep-2017
Big data and cloud computing 9 sep-2017
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Introduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDBIntroduction to NoSQL and MongoDB
Introduction to NoSQL and MongoDB
 

More from Mallikarjuna G D (20)

Reactjs
ReactjsReactjs
Reactjs
 
Bootstrap 5 ppt
Bootstrap 5 pptBootstrap 5 ppt
Bootstrap 5 ppt
 
CSS
CSSCSS
CSS
 
Angular 2.0
Angular  2.0Angular  2.0
Angular 2.0
 
Spring andspringboot training
Spring andspringboot trainingSpring andspringboot training
Spring andspringboot training
 
Hibernate
HibernateHibernate
Hibernate
 
Jspprogramming
JspprogrammingJspprogramming
Jspprogramming
 
Servlet programming
Servlet programmingServlet programming
Servlet programming
 
Servlet programming
Servlet programmingServlet programming
Servlet programming
 
Mmg logistics edu-final
Mmg  logistics edu-finalMmg  logistics edu-final
Mmg logistics edu-final
 
Interview preparation net_asp_csharp
Interview preparation net_asp_csharpInterview preparation net_asp_csharp
Interview preparation net_asp_csharp
 
Interview preparation devops
Interview preparation devopsInterview preparation devops
Interview preparation devops
 
Interview preparation testing
Interview preparation testingInterview preparation testing
Interview preparation testing
 
Interview preparation data_science
Interview preparation data_scienceInterview preparation data_science
Interview preparation data_science
 
Interview preparation full_stack_java
Interview preparation full_stack_javaInterview preparation full_stack_java
Interview preparation full_stack_java
 
Enterprunership
EnterprunershipEnterprunership
Enterprunership
 
Core java
Core javaCore java
Core java
 
Type script
Type scriptType script
Type script
 
Angularj2.0
Angularj2.0Angularj2.0
Angularj2.0
 
Git Overview
Git OverviewGit Overview
Git Overview
 

Recently uploaded

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...only4webmaster01
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsJoseMangaJr1
 

Recently uploaded (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 

Hadoop

  • 1. December 1, 2017 www.snipe.co.in 1 Snipe Team
  • 2. December 1, 2017 2 Introduction to BigData
  • 3. December 1, 2017 1000 B= Kilobyte 1000² B= Megabyte 1000³ B= Gigabyte 1000 B= Terabyte⁴ 1000 B= Petabyte⁵ 1000 B= Exabyte⁶ 1000 B= Zetabyte⁷ 1000 B= Yottabyte⁸ Rise of bytes Introduction to BigData
  • 4. December 1, 2017 4 Types of Data Data is classified into 3 types: •Structured Data •Unstructured Data •Semi-structured Data Structured Data •Fits into the world of RDBMS •Data is perfectly aligned in rows and columns •A tabular format is used for representing data Example: Data storage in MySql Database Introduction to BigData
  • 5. December 1, 2017 5 Unstructured Data •No definite structure can be assigned to this data •Cannot tabulate the data •Cannot put in rows and columns •Cannot fixed into any schema Example: Text files, PDF document, Web server logs, Text, Photos, Voice Semi-structured Data •Data which is between structured and Unstructured •Unstructured data embedded within some structures or tags or schema Example: XML file Introduction to BigData
  • 6. December 1, 2017 6 Big Data Sources: •New York Stock Exchange 4 to 5 TB of data per day •Internet stores 18.5 petabytes of data on a day •Twitter handles 12 TB of tweets everyday •Facebook :  1.5 billion active users monthly  300 PB of user data  10 billion messages per day Big Data – The buzz word •80 to 90% of data is unstructured and that is cannot be fitted into RDBMS based systems. •Big Data is Efficient, Economical and Quicker Introduction to BigData
  • 7. December 1, 2017 7 Characteristics of BigData: (3 V’s is now 6 V’s) 6V’s Volume Velocity Variety Varacity Value Variability Introduction to BigData
  • 8. December 1, 2017 8 Volume: Sheer bulk of data being generated. Velocity: Rate at which data is generated Ex: 5 to 10 TB of data is uploaded on YouTube every 10 mins. Variety: 80 to 90% of Big Data is unstructured and semi-structured Ex: Text data, Voice data, Sensor data etc… Veracity: Uncertainity or correctness of data Ex: Collecting data from sensor Variability: Inconsistencies in the rate at which data is generated Value: What is the value of proposition? What is business value that it makes? The output should be worth the investments made to analyze data. Introduction to BigData
  • 9. December 1, 2017 9 Use Cases of Big Data Financial Services •Fraud detection •Personalized banking services Health Care •Analysis previously restricted to the major players due to expensive tools and technologies •Now since the advent of big data the technology is not very expensive •Many players in health care segment will start working on big data Introduction to BigData
  • 10. December 1, 2017 10 Retail Industry •Oldest consumers of big data •Were using data warehousing techniques very heavily •Now slowly shifting to big data technology Web and Social Media Analytics •Most recent entrant •Heavily looking into big data related research works for  Behavioral  Social Analytics Introduction to BigData
  • 11. December 1, 2017 11 RDBMS and BIG DATA Benefits of RDBMS: •Compatibility • Flexibility •Simplicity •Performance •Robustness Well known RDBMS •Oracle •Microsoft SQL server •MySQL •Teradata •DB2 Introduction to BigData
  • 12. December 1, 2017 12 RDBMS: •Normalization  Data consistency  Eliminates data duplication •Relational databases have to be incredibly complex internally  Example: Simple select statement could have hundreds of query execution parts  RDBMS determines the execution plan using cost based algorithms Introduction to BigData
  • 13. December 1, 2017 13 Drawbacks of RDBMS: •New demand is scalability •New apps being launched: Massive load on storage and scalability •Supporting large number of concurrent users •Dynamic support is needed •Scaling:  A new application can go viral overnight, users increase from zero to million  Some users are frequent, others never return back  Seasonal swings can create spikes  Users need real-time high performance  Vertical scaling possible, not ready for horizontal scaling  Single server node Introduction to BigData
  • 14. December 1, 2017 14 Today’s demand Increased workload due to flexibility requirement Database structures needs to be altered Example: Started selling televisions •Database schema defined for television •Added refrigerator and music system to catalogue Introduction to BigData
  • 15. December 1, 2017 15 Introduction to Parallel computing: Divide a task and conquer Computer 1 Computer 3 Computer 2 Computer 4 • Reduced time by Parallel processing • Faster and quicker Task Introduction to BigData
  • 16. December 1, 2017 16 Challenges of Super Computing: •General purpose operating system did not exist •Buyer of super computers locked to vendors for hardware support •High initial cost of the Hardware •High cost of software maintenance and upgrades to be taken care in-house •Develop custom software for individual use cases HADOOP – The rescue • General purpose operating system like framework • Built-in rich features set of software tools and components • Not locked with one vendor, can be installed on any commodity hardware • Mid sized organizations can afford • Free software (open source) with free upgrades • Distributed computing to wider set of audience Introduction to BigData
  • 17. December 1, 2017 17 Hadoop History: 2000-2002 •Project NUTCH •Open source scalable and robust internet search engine •Doug Cutting and Mike 2003-2004 •Big Table •Map Reduce •Common features with NUTCH 2006 •Doug cutting joined yahoo and created hadoop •NUTCH +Big table + Google MapReduce •Hadoop MapReduce implemented in Java 2008 •Hadoop: Apache project •Stable version of Hadoop used in Yahoo Introduction to BigData
  • 18. December 1, 2017 18 Hadoop
  • 19. December 1, 2017 19 Overview of Hadoop Architecture Basic components of Hadoop 1.0 •Name node •Secondary name node • Job tracker •Task tracker •Data nodes Job in Hadoop Ecosystem •A job is some task submitted by the user to the Hadoop cluster •The job is in the form of a program or collection of programs (a JAR file) which needs to be executed
  • 20. December 1, 2017 20 Attributes: •Programs •Input data to the program i.e a file or collection of files in a directory •Output directory where the results of execution is collected in a files Job •Java mapreduce jobs. •Programs will be submitted into the cluster in the form of a JAR file. •Packaging of all the classes. •Programs need to executed on the particular set of data. Overview of Hadoop Architecture
  • 21. December 1, 2017 21 Overview of Hadoop Architecture Apache Hadoop Core Features Hadoop Distributed File System(HDFS) •When file is submitted into the Hadoop cluster: It resides on multiple data nodes Original file divided into smaller pieces Parallel Processing Framework •Robust •Known as a MapReduce Framework •When you submit the job into the Hadoop cluster: Program executes on a piece of data Runs on multiple machines
  • 22. Cluster of 1000 nodes Name node Secondary name node Job tracker Remaining 997 machines work in the slave mode and act as data nodes Master Slave Architecture Overview of Hadoop Architecture
  • 23. Need not have a very high hard drive storage space •Data nodes need to be very high in terms of hard drive storage space •Nodes take the load of all the data •Big data storage consist of the bulk of data operations Name node Secondary name node Job tracker Overview of Hadoop Architecture
  • 24. December 1, 2017 24 Master Slave Architecture: Host OS •Hadoop as a piece of software framework is installed on native operating system •Installed on all these machines along with the data nodes •The differentiating factor is the software configuration after installing Hadoop •Machines perform responsibilities associated with the name node, secondary name node and job tracker Hadoop Cluster Setup •Machines within a rack communicate with the help of a switch at the speed of 10 gigabytes per second •Multiple racks communicate with the help of a multi layer switch or uplink switch which also acts as a router •Data transfer speed between machines within the same rack is higher than the data transfer speed between the machines across different racks Overview of Hadoop Architecture
  • 25. General Specifications of Hadoop Cluster December 1, 2017 25 Built-in using Commodity Hardware •Makes the hardware easy to procure and maintain •Reduces dependency on just one vendor Processor Built •Most of the data nodes has two hex-core processor or two octa-core processor •2 CPU, each of them at 8 cores •Processing speed lies anywhere between 2.4 to 3.5 gigahertz CPU Overview of Hadoop Architecture
  • 26. December 1, 2017 26 Storage •Amount of RAM varies according to the organizational needs •Name node and job tracker would have higher RAM •Most of the data nodes will be ranging between 50 to 500 GB of high speed RAM Thumb Rule •To decide how much hard drive storage is needed for each data node •Every single core of CPU requires at least 2 Terabytes of hard drive Overview of Hadoop Architecture
  • 27. December 1, 2017 27 Hadoop Services 1.Once Hadoop is installed, certain services are enabled 2.Processes or services are associated with the name node running 3.The machine acquires the role of that data node 4.These are set of Hadoop services or set of Hadoop daemons running 5.A set of software processes or collection of several processes 6.A set of software processes or collection of several processes Overview of Hadoop Architecture
  • 28. Standalone Mode Pseudo Distributed Mode Three different modes in which Hadoop can be installed and deployed Fully Distributed Mode (Cluster Mode) Hadoop is installed on the cluster of interconnected machines Hadoop Deployment Modes
  • 29. December 1, 2017 29 Hadoop Deployment Modes Hadoop Components: Standalone Mode This mode is mainly used for testing purposes These are the software services which actually run as a part of your Hadoop installation Job tracker Name node If JVM crashes, all Hadoop services will also crasha Secondary name node Data node It's the least preferred mode of Hadoop installation and deployment If all the services are sharing a single JVM, this mode is called the Standalone Mode
  • 30. Each of the Hadoop services run on a separate JVM Services run as a part of Hadoop installation Similarity Both run on a single machine Crashing of JVM does not impact Hadoop cluster When JVM crashes, all the Hadoop services also crash Standalone Mode Pseudo Distributed Mode Pseudo Distribute d Mode vs. Standalon e Mode This mode is widely used for learning and development purpose but not for deployment Hadoop Components: Pseudo Distributed Mode Hadoop Deployment Modes
  • 31. December 1, 2017 31 Hadoop Deployment Mode Hadoop Components: Fully Distributed Mode • Used in real production environment • If job tracker is configured on a machine, the dedicated machine runs only the job tracker with Hadoop installation • When the dedicated machine is working as name node, it is when the particular hardware is running on name node services Real VS Pseudo Distributed Mode • All the Hadoop services are well interconnected but on a separate JVM in Real Distributed Mode • All the Hadoop services are on a different JVM but on a single machine in Pseudo Distributed mode
  • 32. December 1, 2017 32 Functionalities of hadoop components Functionalities of Hadoop Components •Name Node All the information is available at name node It is a centralised file namespace server or a file system server • Secondary Name Node Helps the name node to backup the data present in the name node server periodically In the event of a name node failure, the secondary name node will be used to recover and restore the name node
  • 33. Hot standby means that secondary name node will work in a UPS mode Functionalities of Hadoop Components When name node is down, the entire cluster goes down in the case of Hadoop 1.0 As soon as power is down UPS starts backing up Receives an uninterrupted power supply mode Not the case with Hadoop 2.0 Hadoop 1.0 did not have the provision of hot standby name node Functionalities of hadoop components
  • 34. When a cluster is completely down, bring up name node Copy all the backup files from the secondary name node Restore name node operations In Hadoop 2.0 Functionalities of hadoop components
  • 35. December 1, 2017 35 Functionalities of hadoop components Data Node Functionality • Job is a collection of programs or a single program which is going to be operating on a piece of data •On each of the data nodes, a software service called the task tracker runs continuously •Data node stores the big data whenever a job is submitted •Manipulates data before executing on the data node •The decision of which program will be executed by which data node is taken by job tracker
  • 36. December 1, 2017 36 Job Submission and Execution Job Submission and Execution in Hadoop cluster How is a job submitted into the Hadoop cluster? How exactly would the job get executed? Imagine you are working as a data engineer or a data scientist in your team and usually work on a desktop or a laptop •Hadoop is installed in Pseudo Distributed Mode for all testing and development purposes •The Java files are compiled into JAR files, or Java MapReduce files •These Java files or the JAR files are submitted into the cluster as a job
  • 37. Job Client/ Name node Job tracker Gateway Machine Job Submission and Execution in Hadoop cluster Job Client/Gateway Machine •This job client is not exactly a part of the cluster •Hadoop services are not running on it •Configured to communicate with the name node and the job tracker •Job configuration details(.jar)  Input file path  Output file path Job Submission and Execution
  • 38. December 1, 2017 38 Name Node •Job is picked up by a name node •Provides information: Blocks corresponding to the input files Programs where work is residing Job Tracker • Schedule the jobs • Distribute the job to multiple data nodes on which the input file is residing •Result of execution is available in the output path •User can check status of job •The status update can be found using the job tracker: What percentage of the job is being currently completed Information is available periodically Job Submission and Execution
  • 39. December 1, 2017 39 Basic HDFS Basic HDFS • HDFS stands for Hadoop distributed file system • File storage component of Hadoop • Basic architecture of HDFS and Hadoop • How HDFS stores the file internally • Failure handling and recovery mechanism • Rack awareness and block placement strategies • Role of name node and secondary name node • When to use HDFS and when not to Agenda
  • 41. Input file (200 MB) Block 1 Block 2 Block 3 Block N • HDFS breaks the user input file (200 MB) into smaller chunks • Block size is configured by the administrator • Default split size is 64 MB • Split size can be configured depending upon the requirement HDFS : Storage inside HDFS cluster Basic HDFS
  • 42. December 1, 2017 42 File Storage in HDFS •Client machine tries to communicate with the name node •Name node gives out the information about the default split size •Client machine gets an idea of how big each input split will be •Splitting of the files actually happens in the client machine •The name node gives out the information about:  The hostname of the IP addresses of the data nodes  Free space to actually store the data •Client machine directly writes blocks on to data nodes •The client machine or the gateway machine performs this by bypassing the name node •Decision of which block is governed by a specific set of rules •Decision of which block resides on which data node is not done randomly Basic HDFS
  • 43. Data node sends a heartbeat signal to the name node once in every 3 seconds to indicate that it is up and running Heartbeat sent every 3 seconds Data node Data serving Data node Data node Data node Data node HDFS Client Name node Secondary name node Namespace backup Nodes write to local disk Design and Architecture Overview Overview of HDFS
  • 44. • If the data node fails to set the signal once in every 3 seconds:  Name node assumes that particular data node is dead  Takes actions for replicating the data Data node Data serving Data node Data node Data node Data node HDFS Client Name node Secondary name node Namespace backupHeartbeat not received Nodes write to local disk Overview of HDFS
  • 45. December 1, 2017 45 •Data node also sends status information to the name node once in every 6 hours •This value can be configured to a different number by the Hadoop Administrator •Gives information or the block status report of the data node •Complete detail information about what block going to exist on that particular data node Overview of HDFS
  • 46. Name node Secondary name node Job tracker Rack View of Hadoop Cluster Hadoop cluster is deployed in a production environment into multiple racks • The name node, the secondary name node, and job tracker are never placed in a single rack • In the event of failure of rack, the entire cluster would be down Rack view of Hadoop Cluster
  • 47. Input file (200 MB) N1 (64 MB) N2 (64 MB) N3 (64 MB) N4 (8 MB) Failure of a Data Node DN 1 DN 2 DN 3 DN 4 N4 DN 5 DN 6 DN 7 DN 8 DN 9 DN 10 DN 11 DN 12 N2 N1 N3 Data node Failure
  • 48. December 1, 2017 48 Replication of Data Blocks • To avoid loss of data, copies of the data blocks on data nodes is stored on multiple data nodes • Default replication factor is 3 • 3 copies of the same data block on 3 different data nodes • Can be configured by administrator • Replication factor should not be greater than 3 to avoid consuming a lot of hard drive space Data Block Replication
  • 49. File Storage in HDFS Block size = 64MB 300 MB Total blocks = 5 RF = 3 Cluster = 5 nodes 5x64 = 320MB 5th block = 44MB File storage in HDFS
  • 50. December 1, 2017 50 Block placement Strategy Block Placement Strategy/Replica Placement same rack • Two racks :  Rack 1 in the left hand side  Rack 2 in the right hand side • First replica of the block is placed in one of the data nodes in the left hand side or the rack 1 • Two other replicas of same block is split across multiple data nodes but in a same rack
  • 51. Data Replication on Failure DN 1 DN 2 DN 3 DN 4 DN 5 DN 7 DN 8 DN 9 DN 11 DN 10 N2 N1 N4 N3 N3 N4 N2 N1 System wide replication factor = 2 • In event of data node failure:  Data node goes down for the replication count  Replication factor for block N2 is reduced to 1 • HDFS replicates the block N2 into some other data node • Example: N2 is replicated into data node 12 N2DN 12DN 6 Block placement Strategy
  • 52. Data Replication on Failure DN 1 DN 2 DN 3 DN 4 DN 5 DN 7 DN 8 DN 9 DN 10 DN 11 N2 N1 N4 N3 N3 N4 N2 N1 System wide replication factor = 2 What if? • Data node 10 comes up after sometime • Data node 10 was temporarily down • Data node actually comes back after 2 minutes • The name node or HDFS has already replicated the block N2 into some other data node • HDFS deletes one of the extra copies of N2 and can happen from any of the nodes N2DN 12DN 6 Block placement Strategy
  • 53. December 1, 2017 53 Basic HDFS When to/not to use HDFS? •Storing large files order of gigabytes, terabytes and petabytes •Input file size is greater than the input split size Do not use HDFS •Storing large number of small files •High I/O latency when the data is written/read to/from disc •Input file size is smaller than the input split size Use HDFS
  • 54. December 1, 2017 54  WORM - write once read many times patterns  Files cannot be edited/changed  File deleted and retrieved back into the local file system  Edited and then put back into the HDFS data node In HDFS Basic HDFS
  • 55. Master node • Name node with one cluster • Manages the entire file system • Namespace of the metadata of file blocks • Controls the read write access to the files • Manages the block replication • Single point of failure Architectural Overview of Hadoop 1.0 Master Secondar y name node Data node Data node Data node Slav e Name node HDFS Architectural Overview of Hadoop 1.0
  • 56. Master Secondar y name node Data node Data node Data node Slav e Name node • Secondary name node is HDFS namespace backup • One for the cluster • Performs the housekeeping work • Similar hardware as that of name node machine • Not used for a hot standby or a highly available name node backup • Uses system metadata and namespace recovery Architectural Overview of Hadoop 1.0 HDFS Cluster Secondary name node Architectural Overview of Hadoop 1.0
  • 57. 57 Secondary Name Node •Heavy weight lifting nodes in cluster or data nodes •Stores data •Aids in data processing •Serves read write request from clients •Stores and retrieves data blocks •Performs replication tasks upon requests by the name node •Reports block status of system to name node HDFS client •HDFS clients can be many •Act as an interface between the end user and the Hadoop cluster •Help to communicate to the name node and data nodes •Help to submit job •Submit a read-write request to a file •Interface with the name node Architectural Overview of Hadoop 1.0
  • 58. HDFS Namespace •Hierarchy of files and directories •Represented by name node data structures called as I- nodes •Record the attributes of a file  Permission  Access time  Namespace  Disc space quota Metadata file maintains file attributes: Access time Replication factor Stored persistently in a local disc and is called fsImage Architectural Overview of Hadoop 1.0
  • 59. • Edit log file records every change that occurs to file system metadata • Metadata saved in RAM for faster access • Edit logs are merged with metadata periodically • This operation of merging is known as checkpointin • After each checkpoint operation:  Edit logs are cleared  A new entry is added • Merging fsImage with edit logs is done in secondary name node • fsImage file not updated for every write operation • fsImage is loaded into RAM at every node startup • Every 1 hour, contents of RAM are flushed out Architectural Overview of Hadoop 1.0
  • 60. Checkpointing Process • Happens in the secondary name node • Copy of fsimage is kept in the RAM • HDFS file system changes are captured in the edit logs • 'fsimage' loaded as metadata is optimized for read operations and fast searching • Same data corresponding to the edits are captured in edit logs • Edit logs and 'fsimage' need to be merged periodically • New copy of the 'fsimage' contents is reordered into the main memory Architectural Overview of Hadoop 1.0
  • 61. Name node Data node Data node Data node HDF S Clien t FS_Data_Input_Strea m DFS_Input_Stream Distributed File System1 4 7 2 Metadata Requestto get block location 3 Metadata Flow 6 Rea dData Flow 5 Rea d Client JVM Client node HDFS Dataflow: File Read Operation Understanding the steps involved in reading a file from HDFS, an anatomy of a file read operation HDFS Dataflow Anatomy
  • 62. Steps involed in reading a file from HDFS 1.The client opens the file to be read by calling OPEN Distributed File System object 2.The object connects to the name node using RPC to get the metadata information: 3.For each block, name node returns data nodes addresses having a copy of that block 4.Distributed File System returns object which takes care of data node and name node interactions 5.Client calls Read operation on streams to connect to first data node for the first block in the file 6.The data is streamed from the data node back to the client which calls the read repeatedly until it completes the reading of the file 7.When client has finished reading, it calls ˄Close operation˅ on FS_Data_Input_Stream HDFS Dataflow Anatomy
  • 63. 12 Complete HDFS Dataflow: Anatomy of a File Write Operation Name node Data node Data node HDF S Clien t Distributed File System 1 3 2 Creat e Ack Queue Client JVM Client node 5 Data Streamer 4 Data Queue Writing packet 8 8 10 10 11 7 10 Sending Acknowledgement packet Data node FS_Data_Output_Strea m DFS_Output_Stream Data node pipeline6 Understanding the steps involved in writing a file from HDFS, an anatomy of a file write operation HDFS Dataflow Anatomy
  • 64. 1. The client calls Create API on the distributed file system object to create a file 2. Object connects to the name node using an RPC call. Creates a new file in the file system s name with no blocks˅ associated 3. Client calls a Write API on the data 4. DFS_Output_Stream object splits the data into package and writes into the internal Data Queue 5. Asks name node for allocation of new blocks by picking the desirable data nodes to store the replicas 6. List of 3 data nodes form a pipeline 7. Data Steamer pours the packet into the first data node in the pipeline 8. Data Steamer pours the packet into the first data node in the pipeline 9. DFSOutputStream keeps the Ack Queue to store package that are waiting to be acknowledged by the data nodes 10.Sending Acknowledgement packet 11.When client finishes writing data, it calls the ˄Close API on˅ the data stream HDFS Dataflow Anatomy
  • 65. 65 Architecture of MapReduce in Hadoop 1.0 Pig Hive Java Map Reduce (Resource Management) + Job Processing HDFS (Storage) Hadoop 1.X  Jobs submitted to a Hadoop 1.0 cluster get converted to MapR jobs Hadoop Architecture
  • 66. (YARN ) Resourc e Manager Job Schedule r 3rd party framework, plugs into YARN Critical for Machine Learning algorithms Allows Spark to plug into Hadoop Primary & Secondary NameNodes Hot Standby/Highly Available NameNode Allows Spark to plug into Hadoop 49 Hadoop 2.0 YARN Hadoop Architecture
  • 67. December 1, 2017 67 Hadoop 2.0 YARN advantages over Hadoop 1.0 Hadoop Architecture
  • 68. December 1, 2017 68 HADOOP 2.X Core COMPONENTS HDFS YARN Node Manager Resource ManagerName Node Data Node Secondary Name Node Storage Processing Master Slave Hadoop 2.0 Core Components Hadoop Architecture
  • 69. December 1, 2017 69 Clientjjds Scheduler aApplications Manager(AM) CContainer App Master CContainer App Master Resource Manager Node Manager Node Manager Data Node Data Node Resource Manager Master Slave Hadoop Architecture
  • 70. December 1, 2017 70  One Resource Manager (RM) per cluster  The ResourceManager is the rack-aware master node in YARN  Works like an optimised JobTracker (JT)  In YARN, JT is split into two daemons with the RM  Scheduler  Applications Manager (AM)  The Scheduler component of the YARN ResourceManager allocates resources to running applications.  ResourceManager is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system.  It works together with the per-node NodeManagers and the per-application ApplicationMaster. Hadoop Architecture
  • 71. December 1, 2017 71 Scheduler aApplications Manager(AM) Resource Manager Application Manager • Job queue • Resource list • Job Scheduling • Resource allocation Each time a new job is submitted by a client, it first has to pass through the application manager  Maintains log of finished jobs  Validates job application requests and rejects those that violate specifications.  Eliminates duplicate job applications. Hadoop Architecture
  • 72. HADOOP 1.0 HADOOP 2.0 Scalability  Maximum cluster size: 4,000 nodes  Maximum # of concurrent tasks (1000+ mappers and reducers running in parallel): 40,000  JobTracker bottleneck: gets choked up when there’s a lot of traffic (no room for an additional JobTracker)  6,000-10,000 machine clusters  100,000+ concurrent tasks &10,000 concurrent jobs (1 job=1000+ tasks)  Instead of JobTracker, it has a backup Resource Manager. It allows load distribution within the tracker. Hadoop 1.0 Vs Hadoop 2.0 Hadoop 1.0 Vs Hadoop 2.0
  • 73. HADOOP 1.0 HADOOP 2.0 Multitenancy  No support for non- map/reduce jobs  Designed for batch processing workloads  Iterative jobs (e.g. for Machine Learning), not supported  Can’t accommodate third-party frameworks  Only MapReduce app can be  YARN supports both batch processing and non-batch oriented jobs.  Supports TEZ, which is a parallel processing engine that supports interactive and iterative jobs useful for Machine Learning algorithms Hadoop 1.0 Vs Hadoop 2.0
  • 74. HADOOP 1.0 HADOOP 2.0 Availability  Single point of failure, i.e. NameNode  When NameNode crashes, cluster goes down  Jobs need to be re- submitted by users  The cluster is not highly available  Active/Standby NN which works in Hot Standby Mode i.e. Secondary NameNode will kick in, when cluster is still running.  If both Primary and Hot Standby NameNodes go down (which is rare), you can resort to the Secondary NameNode. 90% chance of both NameNodes crashing simultaneously. Hadoop 1.0 Vs Hadoop 2.0
  • 75. HADOOP 1.0 HADOOP 2.0  JobTracker: Gets choked up from traffic. Responsible for scheduling and centralized resource allocation in Master mode.  TaskTracker: doing heavy lifting in the DataNodes  Resource Manager is like the JobTracker  Consists of a) Scheduler that schedules activities & and b) an Application Manager (not Master) for, resource allocation and monitoring.  Application Master: equivalent of TaskTracker in MR v1. Responsible for task execution and updation.. Hadoop 1.0 Vs Hadoop 2.0
  • 76. December 1, 2017 76 Hadoop 3.x •Apache Hadoop 3 is round the corner with members of the Hadoop community at Apache Software Foundation still testing it. •Apache Hadoop 3.0 will bring in with thousands of new bug fixes, features and enhancements over Hadoop 2.0. •The major release of Hadoop 3.x is anticipated to be rolled out sometime mid of 2017. Why hadoop 3.x? •With Java 7 attaining end of life in 2015, there was a need to revise the minimum runtime version to Java 8 with a new Hadoop release so that the new release is supported by Oracle with security fixes and also will allow hadoop to upgrade its dependencies to modern versions. Overview of Hadoop 3.0
  • 77. December 1, 2017 77 • With Hadoop 2.0 shell scripts were difficult to understand as hadoop developers had to read almost all the shell scripts to understand what is the correct environment variable to set an option and how to set it whether it is java.library.path or java classpath or GC options. • With support for only 2 NameNodes, Hadoop 2 did not provide maximum level of fault tolerance but with the release of Hadoop 3.x there will be additional fault tolerance as it offers multiple NameNodes. • Replication is a costly affair in Hadoop 2 as it follows a 3x replication scheme leading to 200% additional storage space and resource overhead. Hadoop 3.0 will incorporate Erasure Coding in place of replication consuming comparatively less storage space whilst providing same level of fault tolerance. Overview of Hadoop 3.0
  • 78. December 1, 2017 78 What’s New in Hadoop 3.0? •Minimum Runtime Version for Hadoop 3.0 is JDK 8 •Support for Erasure Coding in HDFS •Hadoop Shell Script Rewrite •MapReduce Task Level Native Optimization •Support for Multiple NameNodes to maximize Fault Tolerance • Introducing a More Powerful YARN in Hadoop 3.0 •Change in Default Ports for Various Services and Addition of New Default Ports Overview of Hadoop 3.0
  • 79. December 1, 2017 79 Hadoop 2.x vs. Hadoop 3.x Features Hadoop 2.x Hadoop 3.x Minimum Required Java Version JDK 6 and above. JDK 8 is the minimum runtime version of JAVA required to run Hadoop 3.x as many dependency library files have been used from JDK 8. Fault Tolerance Fault Tolerance is handled through replication leading to storage and network bandwidth overhead. Support for Erasure Coding in HDFS improves fault tolerance Hadoop 2.0 Vs Hadoop 3.0
  • 80. 80 Features Hadoop 2.x Hadoop 3.x Storage Scheme Follows a 3x Replication Scheme for data recovery leading to 200% storage overhead. For instance, if there are 8 data blocks then a total of 24 blocks will occupy the storage space because of the 3x replication scheme Storage overhead in Hadoop 3.0 is reduced to 50% with support for Erasure Coding. In this case, if here are 8 data blocks then a total of only 12 blocks will occupy the storage space Change in Port Numbers Hadoop HDFS NameNode -8020 Hadoop HDFS DataNode -50010 Secondary NameNode HTTP -50091 Hadoop HDFS NameNode -9820 Hadoop HDFS DataNode -9866 Secondary NameNode HTTP -9869 Hadoop 2.0 Vs Hadoop 3.0
  • 81. December 1, 2017 81 Features Hadoop 2.x Hadoop 3.x YARN Timeline Service YARN timeline service introduced in Hadoop 2.0 has some scalability issues. YARN Timeline service has been enhanced with ATS v2 which improves the scalability and reliability. Intra DataNode Balancing HDFS Balancer in Hadoop 2.0 caused skew within a DataNode because of addition or replacement of disks. Intra DataNode Balancing has been introduced in Hadoop 3.0 to address the intra-DataNode skews which occur when disks are added or replaced. Number of NameNodes Hadoop 2.0 introduced a secondary namenode as standby. Hadoop 3.0 supports 2 or more NameNodes. Hadoop 2.0 Vs Hadoop 3.0
  • 82. December 1, 2017 82 Hadoop Installation: 1)update Ubuntu $ sudo apt-get update 2) Download and Install JDK $ sudo apt-get install default-jdk Reference link: https://www.digitalocean.com/community/tutorials/how-to-install-ja 3) Check java Installed or Not $ java -version 4) Install SSH $ sudo apt-get install openssh-server Hadoop Installation
  • 83. December 1, 2017 83 5) Configuring SSH $ ssh-keygen -t rsa -P "" note: Getting this line (Enter file in which to save the key (/home/manju/.ssh/id_rsa): ) please enter “ENTER key” in keyboard 6) Copy id_rsa.pub to authorized keys $ cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys 7) Disabling IPv6 For getting your IPv6 disable in your Linux machine, you need to update /etc/sysctl.conf by adding following line of codes at end of the file Hadoop Installation
  • 84. December 1, 2017 84 $ sudo gedit /etc/sysctl.conf note: type above command in terminal you will get one sysctl.conf file, put below 4 lines in that file # disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1 8) Now go and download Hadoop tar.gz file in below given link http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop- 2.6.4/hadoop-2.6.4.tar.gz or use http://mirror.fibergrid.in/apache/hadoop/common/hadoo p-2.6.4/hadoop-2.6.4.tar.gz 9) now create one apache folder in home directory then go to download folder copy that Hadoop newly downloaded file and copy that file into apache folder Hadoop Installation
  • 85. December 1, 2017 85 10) extract Hadoop zip file in same directory note: for extract purpose, select Hadoop tar file, right click on tar file then you can see the option like extract here option choose that option it will extract automatically in same folder. 11) Then create 2 new folders inside Hadoop directory i) folder names are yarn inside yarn hdfs directory inside hdfs namenode directory and datanode directory. The folder structure like this: /home/manju/apache/hadoop/yarn/hdfs/namenode /home/manju/apache/hadoop/yarn/hdfs/datanode 12) Give permissions for newly created directories $ chmod 777 -R /home/manju/apache/hadoop/yarn Hadoop Installation
  • 86. December 1, 2017 86 13) Update Hadoop configuration files $ sudo gedit .bashrc following environment variables at the end of bashrc file # -- HADOOP ENVIRONMENT VARIABLES START -- # export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 (Change the path according to your pc configuration) export HADOOP_HOME=/home/manju/apache/hadoop (Change the path according to your pc configuration) export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$HADOOP_HOME/sbin export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib" # -- HADOOP ENVIRONMENT VARIABLES END -- # Note: After Configure above Variables just refresh the bashrc for that… $ source .bashr Hadoop Installation
  • 87. December 1, 2017 87 14) change the setting in Hadoop-env.sh go to hadoop installed directory then open etc directory then hadoop folder then open hadoop-env.sh then edit or paste java home path available in /usr/lib/jvm/java-8-openjdk- amd64 # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 15) change the setting in core-site.xml file go to Hadoop installed directory then open etc. directory then Hadoop folder then open core-site.xml then edit using gedit tool or Paste these lines into <configuration> tag <property> <name>fs.default.name</name> <value>hdfs://localhost:9000</value> </property> Hadoop Installation
  • 88. December 1, 2017 88 16) change the setting in hdfs-site.xml file go to Hadoop installed directory then open etc directory then hadoop folder then open hdfs-site.xml then edit using gedit tool or Paste these lines into <configuration> tag <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/home/manju/apache/hadoop/yarn/hdfs/namenode</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/home/manju/apache/hadoop/yarn/hdfs/datanode</value> </property> Note: here you have to change your directory structure according to your pc, which one we created earlier directories are namenode and datanode Hadoop Installation
  • 89. December 1, 2017 89 17) change the setting in yarn-site.xml file go to Hadoop installed directory then open etc directory then Hadoop folder then open yarn-site.xml then edit using gedit tool or Paste these lines into <configuration> tag <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> Hadoop Installation
  • 90. December 1, 2017 90 18) change the setting in mapred-site.xml file note: Copy template of mapred-site.xml.template file, then paste in same directory, rename that copied file into mapred-site.xml go to Hadoop installed directory then open etc. directory then Hadoop folder then open mapred-site.xml then edit using gedit tool or Paste these lines into <configuration> tag <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> 19) Format namenode $ cd apache/hadoop/ (then press enter it will go to Hadoop home directory) manju@ubuntu:~/apache/hadoop$ hdfs namenode -format (use this command to format hdfs) Hadoop Installation
  • 91. December 1, 2017 91 20) After Format completes run these 2 commands to start Hadoop $ start-dfs.sh $ start-yarn.sh note: when you run the above commands it will ask (yes/no) just give "yes" for that 21) finally check whether Hadoop working or not $ jps Note: its show total 6 Daemons in terminal manju@ubuntu:~/apache/hadoop$ jps 2337 NameNode 3094 NodeManager 3127 Jps 2986 ResourceManager 2443 DataNode 2845 SecondaryNameNode 22) To stop Hadoop use $ stop-dfs.sh $ stop-yarn.sh Hadoop Installation