Learn Hadoop Administration

www.edureka.in/hadoop-admin
Course Topics
 Week 1
– Understanding Big Data
– A typical Hadoop Cluster
– Hadoop Cluster Administrator: Roles and
Responsibilities
 Week 2
– Hadoop 2.0
– Hadoop Configuration files
– Popular Hadoop Distributions
 Week 3
– Different Hadoop Server Roles
– Data processing flow
– Cluster Network Configuration
 Week 4
– Job Scheduling
– Fair Scheduler
– Monitoring a Hadoop Cluster
 Week 5
– Securing your Hadoop Cluster
– Kerberos and HDFS Federation
– Backup and Recovery
 Week 6
– Oozie and Hive Administration
– HBase Architecture
– HBase Administration

Topics for Today
 Revision
 Hadoop 2.0
 Hadoop Configuration Files
 Plan your Hadoop Cluster: Hardware Considerations
 Plan your Hadoop Cluster: Software Considerations
 Popular Hadoop Distributions

 Hadoop Core Components
 Different Cluster Modes
Lets’s Revise

Client
HDFS Map Reduce
Hadoop 1.0
Secondary
Name Node
Data
Blocks
Data Node
Name Node Job Tracker
Task Tracker
Map Reduce
Data Node Task Tracker
Map Reduce
….

Hadoop 1.0 Vs. Hadoop 2.0
Property Hadoop 1.x Hadoop 2.x
NameNodes 1 Many
High Availability Not present Highly Available
Processing Control JobTracker, Task Tracker Resource Manager, Node
Manager, App Master

Hadoop 2.0 HDFS Federation
http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html
Namenode
Block Management
NS
Storage
Datanode Datanode…
NamespaceBlockStorage
Namespace
NS1 NSk NSn
NN-1 NN-k NN-n
Common Storage
Datanode 1
…
Datanode 2
…
Datanode m
…
BlockStorage
Pool 1 Pool k Pool n
Block Pools
… …

Hadoop 2.0 HDFS NameNode High Availability
Shared
edit logs
Data Blocks
….
Data Nodes are configured with the
location of both Name Nodes, and send
block location information and heartbeats
to both.
Read edit logs and applies to its own
namespace
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Active
Name Node
Standby
Name Node
Data Node Data Node Data Node Data Node
Secondary
Name Node

Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2)
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html
YARN = Yet Another Resource Manager
Node Manager
Container Container
Node Manager
App
Master
Container
Node Manager
Container
App
Master
Resource
Manager
Client
Client
MapReduce Status
Job Submission
Node Status
Resource Request

Client
HDFS
YARN
Resource Manager
Hadoop 2.0
Shared
edit logs
All name space edits
logged to shared NFS
storage; single writer
(fencing)
Read edit logs and applies
to its own namespace
Secondary
Name Node
Data Node Data Node
Data Node Data Node
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Standby
NameNode
Node Manager
Container
App
Master
Node Manager
Container
App
Master
Active
NameNode

Hadoop 2.0 Configuration Files
Configuration
Filenames
Description of Log Files
hadoop-env.sh
yarn-env.sh Settings for Hadoop Daemon‟s process environment.
core-site.xml
Configuration settings for Hadoop Core such as I/O settings that common to both HDFS
and YARN.
hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes.
yarn-site.xml Configuration setting for ResourceManager and NodeManager.
mapred-site.xml Configuration settings for MapReduce Applications.
slaves A list of machines (one per line) that each run DataNode and NodeManager.

Hadoop 2.0 Configuration Files

Deprecated Properties
Deprecated Property Name New Property Name
dfs.data.dir dfs.datanode.data.dir
dfs.http.address dfs.namenode.http-address
fs.default.name fs.defaultFS
The core functionality and usage of these core configuration files are same in Hadoop 2.0
and 1.0 but many new properties have been added and many have been deprecated.
For example:
 ‟fs.default.name‟ has been deprecated and replaced with „fs.defaultFS‟ for YARN in core-site.xml
 „dfs.nameservices‟ has been added to enable NameNode High Availability in hdfs-site.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
 In Hadoop 2.x.x (CDH4) release, you can use either the old or the new properties.
– The old property names are now deprecated, but still work!

Runtime Environment
 Offers a way to provide custom parameters for each of the servers.
 Sourced by the Hadoop Daemons start/stop scripts.
 Examples of environment variables that you can specify:
HADOOP_DATANODE_HEAPSIZE
YARN_HEAPSIZE
Set parameter JAVA_HOME
JVM
hadoop-env.sh
yarn-env.sh
Map
Reduce

Configuration Files for Core Components
Core core-site.xml
HDFS hdfs-site.xml
mapred-site.xml
Map
Reduce
yarn-site.xmlYARN

core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
<?xml version - "1.0"?> <?xml version ="1.0"?>
 
<configuration> <configuration>
<property> <property>
<name>dfs.replication</name> <name>fs.defaultFS</name>
<value>1</value> <value>hdfs://test.abc.in:8020/</value>
</property> </property>
</configuration> </configuration>
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml

mapred-site.xml
mapred-site.xml
<?xml version=“1.0”?>
<configuration>
<property>
<name>mapreduce.jobhistory.address</name>
<value>test.abc.in:10020</value>
<property>
</configuration>
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
http://hadoop.apache.org/docs/stable/mapred_tutorial.html
Notice difference in URL for
current and stable release

yarn-site.xml
yarn-site.xml
<?xml version=“1.0”?>
<configuration>
<property>
<name>yarn.resourcemanager.address</name>
<value>test.abc.in:8021</value>
<property>
</configuration>
http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml

Slaves
Map
Reduce
Slaves
 Contains a list of slave hosts, one per line, that are to host DataNode and
NodeManager servers.

www.edureka.in/hadoop-adminhttp://wiki.apache.org/hadoop/PoweredBy
Hadoop Cluster: Facebook

Hadoop Cluster: A Typical Use Case (Hadoop 1.0)
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
Ethernet: 3 X 10 GB/s
OS: 32bit CentOS
RAM: 64 GB,
Hard disk: 1 TB
Processor: Xenon with 8 Cores
OS: 32bit CentOS
RAM: 32 GB,
Hard disk: 1 TB
Processor: Xenon with 4 Cores
OS: 32bit CentOS
Name Node Secondary Name Node
Data Node
RAM: 16GB
Hard disk: 6 X 2TB
Processor: Xenon with 2 cores.
OS: 32bit CentOS
Data Node

Hadoop Cluster: Thinking About The Problem
Single Machine
 Great for testing,
developing.
 Not a practical
implementation for
large amounts of data.
 Initially four or six
nodes.
 As the volume of data
grows, more nodes can
easily be added.
Ways of deciding when the
cluster needs to grow
 Increasing amount of
computation power
needed.
data which needs to be
stored.
memory needed to
process tasks.
Hadoop Cluster
Small Cluster Large Cluster

 Master Hardware
 Namenode requirements
 RAM to fit metadata
 Modest but dedicated disk
 Secondary Namenode
 Almost identical to Namenode
 Resource Manager
 Retain Job Data, Memory Hungry
 Memory requirements can grow
independent of cluster size
 Slave Hardware
 Storage
 Computation
 Cluster Sizing
 Usage Pattern and Workloads
 IO-bound or CPU-bound
 Consider requirements for
additional components such as
HBase
Plan your Hadoop Cluster: Hardware

 Operating System
 Linux is the only production quality option today.
 A significant number run on RHEL.
 Java
 JDK- the most critical software
 List of tested JVMs:
http://wiki.apache.org/hadoop/HadoopJavaVers
ions
 Java 1.6.x
 Operating System utilities
 ssh
 cron
 rsync
 ntp
Plan your Hadoop Cluster: Software

 Choose a Distribution and Version of Hadoop
Popular Hadoop Distributions
 Apache Hadoop
 Complex Cluster setup
 Manual install and Integration of Hadoop
ecosystem components such as Pig, Hive,
HBase etc
 No commercial Support
 Good for First try
 Cloudera
 Established distribution with many referenced
deployments
 Powerful tools for deployment, management
and monitoring such as Cloudera Manager

 HortonWorks
 Only distribution without any modification in Apache Hadoop
 HCatalog for metadata
 Stinger for Hive
 MapR
 Support native Unix filesystem
 HA features such as snapshots, mirroring or stateful failover
 Amazon Elastic Map Reduce (EMR)
 Hosted Solution
 Only Pig and Hive are available as of now
Popular Hadoop Distributions

Assignments – Status
 Attempt the following Assignments using the documents present in the LMS:
 Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or VirtualBox.

Thank You
See You in Class Next Week

Learn Hadoop Administration

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Learn Hadoop Administration

Similar to Learn Hadoop Administration (20)

More from Edureka!

More from Edureka! (20)

Recently uploaded

Recently uploaded (20)

Learn Hadoop Administration

Editor's Notes