View Hadoop Administration Course at www.edureka.co/hadoop-admin
Power the Hadoop Cluster with AWS Cloud
www.edureka.co/hadoop-adminSlide 2 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
Objectives
At the end of this module, you will be able to
Hadoop Cluster introduction
Recommended Configuration for cluster
Hadoop cluster running modes
Hadoop configuration files
Hadoop Admin Responsibilities
Hadoop cluster set up on AWS Demo
Slide 3Slide 3Slide 3 www.edureka.co/java-hadoop
Hadoop Core Components
Hadoop 2.x Core Components
HDFS YARN
Storage Processing
DataNode
NameNode Resource Manager
Node Manager
Master
Slave
Secondary
NameNode
www.edureka.co/hadoop-admin
Slide 4
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Hadoop Cluster: A Typical Use Case
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
Active NameNodeSecondary NameNode
DataNode DataNode
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
Power: Redundant Power Supply
StandBy NameNode
Optional
RAM: 16GB
Hard disk: 6 x 2TB
Processor: Xenon with 2 cores
Ethernet: 3 x 10 GB/s
OS: 64-bit CentOS
DataNode
DataNode DataNode DataNode
www.edureka.co/hadoop-admin
www.edureka.co/hadoop-adminSlide 5
Seeking cluster growth on storage capacity is often a good method to use!
Cluster Growth Based On Storage Capacity
Data grows by approximately
5TB per week
HDFS set up to replicate each
block three times
Thus, 15TB of extra storage
space required per week
Assuming machines with 5x3TB
hard drives, equating to a new
machine required each week
Assume Overheads to be 30%
www.edureka.co/hadoop-adminSlide 6
Slave Nodes: Recommended Configuration
Higher-performance vs lower performance components
Save the Money, Buy more Nodes!
 General ( Depends on requirement
‘base’ configuration for a slave Node
» 4 x 1 TB or 2 TB hard drives, in a
JBOD* configuration
» Do not use RAID!
» 2 x Quad-core CPUs
» 24 -32GB RAM
» Gigabit Ethernet
General Configuration
 Multiples of ( 1 hard drive + 2 cores
+ 6-8GB RAM) generally work well
for many types of applications
Special Configuration
Slave Nodes
“A cluster with more nodes performs better than one with fewer, slightly faster nodes”
www.edureka.co/hadoop-adminSlide 7
Slave Nodes: More Details (RAM)
Slave Nodes (RAM)
Generally each Map or Reduce task
will take 1GB to 2GB of RAM
Slave nodes should not be using
virtual memory
RULE OF THUMB!
Total number of tasks = 1.5 x number
of processor core
Ensure enough RAM is present to
run all tasks, plus the DataNode,
TaskTracker daemons, plus the
operating system
www.edureka.co/hadoop-adminSlide 8
Master Node Hardware Recommendations
Carrier-class hardware
(Not commodity hardware)
Dual power supplies
Dual Ethernet cards
(Bonded to provide failover)
Raided hard drives
At least 32GB of RAM
Master
Node
Requires
www.edureka.co/hadoop-adminSlide 9
Hadoop Cluster Modes
Hadoop can run in any of the following three modes:
Fully-Distributed Mode
Pseudo-Distributed Mode
 No daemons, everything runs in a single JVM
 Suitable for running MapReduce programs during development
 Has no DFS
 Hadoop daemons run on the local machine
 Hadoop daemons run on a cluster of machines
Standalone (or Local) Mode
www.edureka.co/hadoop-adminSlide 10
Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh
Settings for Hadoop Daemon’s process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS and
YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for Resource Manager and Node Manager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and Node Manager.
www.edureka.co/hadoop-adminSlide 11
Configuration Files (Contd.)
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are same in Hadoop 2.0 and 1.0 but many new properties
have been added and many have been deprecated
For example:
 ’fs.default.name’ has been deprecated and replaced with ‘fs.defaultFS’ for YARN in core-site.xml
 ‘dfs.nameservices’ has been added to enable NameNode High Availability in hdfs-site.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
 In Hadoop 2.2.0 release, you can use either the old or the new properties
 The old property names are now deprecated, but still work!
Slide 12
Core
HDFS
core-site.xml
hdfs-site.xml
yarn-site.xmlYARN
mapred-site.xml
Map
Reduce
Hadoop 2.x Configuration Files – Apache Hadoop
www.edureka.co/hadoop-admin
Slide 13
Hadoop Daemons
NameNode daemon
» Runs on master node of the Hadoop Distributed File System (HDFS)
» Directs Data Nodes to perform their low-level I/O tasks
DataNode daemon
» Runs on each slave machine in the HDFS
» Does the low-level I/O work
Resource Manager
» Runs on master node of the Data processing System(MapReduce)
» Global resource Scheduler
Node Manager
» Runs on each slave node of Data processing System
» Platform for the Data processing tasks
Job HistoryServer
» JobHistoryServer is responsible for servicing all job history related requests from client
www.edureka.co/hadoop-admin
www.edureka.co/hadoop-adminSlide 14
Why Cloud?
Challenges in current trend:
Arranging a large common storage area
Providing secure access to the shared data
www.edureka.co/hadoop-adminSlide 15
Amazon EC2
A cloud web host that allows you to dynamically add and remove computer server resources as you need them,
allowing you to pay for only the capacity that you used.
Good For Hadoop Cluster set : we can bring up enormous cluster with in minutes and then spin it down when we
have finished to reduce costs.
www.edureka.co/hadoop-adminSlide 16
Hadoop on AWS
ANALYZING…
www.edureka.co/hadoop-adminSlide 17
DEMO
www.edureka.co/hadoop-adminSlide 18
Hadoop Admin Responsibilities
Responsible for implementation and administration of Hadoop infrastructure.
Testing HDFS, Hive, Pig and MapReduce access for Applications.
Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching.
Performance tuning and Capacity planning for Clusters.
Monitor Hadoop cluster and deploy security.
LIVE Online Class
Class Recording in LMS
24/7 Post Class Support
Module Wise Quiz
Project Work
Verifiable Certificate
www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
How it Works?
Questions
www.edureka.co/hadoop-adminSlide 20 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
www.edureka.co/hadoop-adminSlide 21
Course Topics
 Module 1
» Hadoop Cluster Administration
 Module 2
» Hadoop Architecture and Cluster setup
 Module 3
» Hadoop Cluster: Planning and Managing
 Module 4
» Backup, Recovery and Maintenance
 Module 5
» Hadoop 2.0 and High Availability
 Module 6
» Advanced Topics: QJM, HDFS Federation and
Security
 Module 7
» Oozie, Hcatalog/Hive and HBase Administration
 Module 8
» Project: Hadoop Implementation
Power Hadoop Cluster with AWS Cloud

Power Hadoop Cluster with AWS Cloud

  • 1.
    View Hadoop AdministrationCourse at www.edureka.co/hadoop-admin Power the Hadoop Cluster with AWS Cloud
  • 2.
    www.edureka.co/hadoop-adminSlide 2 Twitter@edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions Objectives At the end of this module, you will be able to Hadoop Cluster introduction Recommended Configuration for cluster Hadoop cluster running modes Hadoop configuration files Hadoop Admin Responsibilities Hadoop cluster set up on AWS Demo
  • 3.
    Slide 3Slide 3Slide3 www.edureka.co/java-hadoop Hadoop Core Components Hadoop 2.x Core Components HDFS YARN Storage Processing DataNode NameNode Resource Manager Node Manager Master Slave Secondary NameNode www.edureka.co/hadoop-admin
  • 4.
    Slide 4 RAM: 16GB Harddisk: 6 x 2TB Processor: Xenon with 2 cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Hadoop Cluster: A Typical Use Case RAM: 16GB Hard disk: 6 x 2TB Processor: Xenon with 2 cores. Ethernet: 3 x 10 GB/s OS: 64-bit CentOS RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply Active NameNodeSecondary NameNode DataNode DataNode RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS Power: Redundant Power Supply StandBy NameNode Optional RAM: 16GB Hard disk: 6 x 2TB Processor: Xenon with 2 cores Ethernet: 3 x 10 GB/s OS: 64-bit CentOS DataNode DataNode DataNode DataNode www.edureka.co/hadoop-admin
  • 5.
    www.edureka.co/hadoop-adminSlide 5 Seeking clustergrowth on storage capacity is often a good method to use! Cluster Growth Based On Storage Capacity Data grows by approximately 5TB per week HDFS set up to replicate each block three times Thus, 15TB of extra storage space required per week Assuming machines with 5x3TB hard drives, equating to a new machine required each week Assume Overheads to be 30%
  • 6.
    www.edureka.co/hadoop-adminSlide 6 Slave Nodes:Recommended Configuration Higher-performance vs lower performance components Save the Money, Buy more Nodes!  General ( Depends on requirement ‘base’ configuration for a slave Node » 4 x 1 TB or 2 TB hard drives, in a JBOD* configuration » Do not use RAID! » 2 x Quad-core CPUs » 24 -32GB RAM » Gigabit Ethernet General Configuration  Multiples of ( 1 hard drive + 2 cores + 6-8GB RAM) generally work well for many types of applications Special Configuration Slave Nodes “A cluster with more nodes performs better than one with fewer, slightly faster nodes”
  • 7.
    www.edureka.co/hadoop-adminSlide 7 Slave Nodes:More Details (RAM) Slave Nodes (RAM) Generally each Map or Reduce task will take 1GB to 2GB of RAM Slave nodes should not be using virtual memory RULE OF THUMB! Total number of tasks = 1.5 x number of processor core Ensure enough RAM is present to run all tasks, plus the DataNode, TaskTracker daemons, plus the operating system
  • 8.
    www.edureka.co/hadoop-adminSlide 8 Master NodeHardware Recommendations Carrier-class hardware (Not commodity hardware) Dual power supplies Dual Ethernet cards (Bonded to provide failover) Raided hard drives At least 32GB of RAM Master Node Requires
  • 9.
    www.edureka.co/hadoop-adminSlide 9 Hadoop ClusterModes Hadoop can run in any of the following three modes: Fully-Distributed Mode Pseudo-Distributed Mode  No daemons, everything runs in a single JVM  Suitable for running MapReduce programs during development  Has no DFS  Hadoop daemons run on the local machine  Hadoop daemons run on a cluster of machines Standalone (or Local) Mode
  • 10.
    www.edureka.co/hadoop-adminSlide 10 Configuration Files Configuration Filenames Descriptionof Log Files hadoop-env.sh yarn-env.sh Settings for Hadoop Daemon’s process environment. core-site.xml Configuration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN. hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes. yarn-site.xml Configuration setting for Resource Manager and Node Manager. mapred-site.xml Configuration settings for MapReduce Applications. slaves A list of machines (one per line) that each run DataNode and Node Manager.
  • 11.
    www.edureka.co/hadoop-adminSlide 11 Configuration Files(Contd.) Deprecated Property Name New Property Name dfs.data.dir dfs.datanode.data.dir dfs.http.address dfs.namenode.http-address fs.default.name fs.defaultFS The core functionality and usage of these core configuration files are same in Hadoop 2.0 and 1.0 but many new properties have been added and many have been deprecated For example:  ’fs.default.name’ has been deprecated and replaced with ‘fs.defaultFS’ for YARN in core-site.xml  ‘dfs.nameservices’ has been added to enable NameNode High Availability in hdfs-site.xml http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html  In Hadoop 2.2.0 release, you can use either the old or the new properties  The old property names are now deprecated, but still work!
  • 12.
  • 13.
    Slide 13 Hadoop Daemons NameNodedaemon » Runs on master node of the Hadoop Distributed File System (HDFS) » Directs Data Nodes to perform their low-level I/O tasks DataNode daemon » Runs on each slave machine in the HDFS » Does the low-level I/O work Resource Manager » Runs on master node of the Data processing System(MapReduce) » Global resource Scheduler Node Manager » Runs on each slave node of Data processing System » Platform for the Data processing tasks Job HistoryServer » JobHistoryServer is responsible for servicing all job history related requests from client www.edureka.co/hadoop-admin
  • 14.
    www.edureka.co/hadoop-adminSlide 14 Why Cloud? Challengesin current trend: Arranging a large common storage area Providing secure access to the shared data
  • 15.
    www.edureka.co/hadoop-adminSlide 15 Amazon EC2 Acloud web host that allows you to dynamically add and remove computer server resources as you need them, allowing you to pay for only the capacity that you used. Good For Hadoop Cluster set : we can bring up enormous cluster with in minutes and then spin it down when we have finished to reduce costs.
  • 16.
  • 17.
  • 18.
    www.edureka.co/hadoop-adminSlide 18 Hadoop AdminResponsibilities Responsible for implementation and administration of Hadoop infrastructure. Testing HDFS, Hive, Pig and MapReduce access for Applications. Cluster maintenance tasks like Backup, Recovery, Upgrade, Patching. Performance tuning and Capacity planning for Clusters. Monitor Hadoop cluster and deploy security.
  • 19.
    LIVE Online Class ClassRecording in LMS 24/7 Post Class Support Module Wise Quiz Project Work Verifiable Certificate www.edureka.co/hadoop-adminSlide 19 Twitter @edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions How it Works?
  • 20.
    Questions www.edureka.co/hadoop-adminSlide 20 Twitter@edurekaIN, Facebook /edurekaIN, use #askEdureka for Questions
  • 21.
    www.edureka.co/hadoop-adminSlide 21 Course Topics Module 1 » Hadoop Cluster Administration  Module 2 » Hadoop Architecture and Cluster setup  Module 3 » Hadoop Cluster: Planning and Managing  Module 4 » Backup, Recovery and Maintenance  Module 5 » Hadoop 2.0 and High Availability  Module 6 » Advanced Topics: QJM, HDFS Federation and Security  Module 7 » Oozie, Hcatalog/Hive and HBase Administration  Module 8 » Project: Hadoop Implementation