SlideShare a Scribd company logo
Bridging Technology Gap Netxillon Technologies
Hadoop Administration
By Gurmukh Singh
Module 7: Data Ingestion and Storage
Hadoop Netxillon Technologies
Key Points from Module 5:
• Advantages of YARN.
• Hadoop HA
• Shared Storage
• JQM
Hadoop Netxillon Technologies
Agenda:
• Ingesting Data into Hadoop.
• Hadoop as Data Ware house
• Pig Scripting.
• Flume to ingest data – twitter and Apache Web Server Logs.
• Hive as Query Layer.
• Metastore Mysql store
• Hbase as NoSql Database
• HA setup for HBase
Hadoop Netxillon
Technologies
Hadoop 2.0 Setup differences
- The configuration files location has now moved to “$HADOOP_HOME/etc/hadoop”
- The jar are now located at “$HADOOP_HOME/share/hadoop/mapreduce/*example.jar”
- The location for admin binaries is now at “$HADOOP_HOME/sbin”
- Jobtracker/tasktracker have been upgraded to Resource/Node Manager.
- There is no “hadoop-daemon.sh start resourcemanger” command, it is upgraded to yarn command line.
- The Job execution is done by YARN
Hadoop Netxillon
Technologies
Hadoop 2.0 Cluster Setup hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/data/namenode</value>
</property>
core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://ha-nn1.hacluster1.com:9000</value>
</property>
</configuration>
Hadoop Netxillon
Technologies
yarn-site.xml
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-
services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
Hadoop 2.0 Distributed Setup
mapred-site.xml
<property>
<name>mapreduce.framework.name</name><
value>yarn</value>
</property>
Hadoop Netxillon
Technologies
Hadoop 2.0 Distributed Setup
DEMO
Hadoop Netxillon
Technologies
Job Tracker Disadvantages:
• Is single point of failure.
• JobTracker is heavy loaded.
• Does resource Management
• Job Scheduling
• Takes care of job failures and recreation
Hadoop Netxillon
Technologies
YARN – Yet another resource negotiator
Firstly, Yarn and MRv2 are not the same thing.
Each job controls its own destiny.
Responsible for Cluster resource Management
Hadoop Netxillon
Technologies
YARN components
Hadoop Netxillon
Technologies
YARN Flow
• Client submits job and with the help of ResourceManager gets a Application ID.
• RM chooses a NodeManager with available resources and requests MR App Master.
• Node Manager allocates container for the Master and the assigns MR job to it.
• Splits are read from the HDFS by the MRApp Master.
• MRApp Master again negotiates with Resource Manager to find the node with maximum resources.
• MRApp Master assigns the map/reduces tasks on that particular NodeManager.
• NodeManager creates Yarnchild to execute the jobs.
• Yarnchild executes the map and reduce task after acquiring the resources from HDFS.
Hadoop Netxillon
Technologies
YARN components
Hadoop Netxillon
Technologies
YARN Flow
Hadoop Netxillon
Technologies
Hadoop HA - HDFS
Namenode is a single point of failure, what if it fails ?
We will have outage, and sometimes data loss due to corruption.
How quickly we can do the switch if needed.
Whether the switch is a manual failover or Automatic failover.
Lets look at all the above questions.
Hadoop Netxillon
Technologies
Hadoop HA – using shared NFS
Hadoop Netxillon
Technologies
Hadoop HA – using shared NFS
DEMO
Hadoop Netxillon
Technologies
Hadoop HA - HDFS
Using Zookeeper
Hadoop Netxillon
Technologies
Hadoop HA - HDFS
Using Zookeeper
Hadoop Netxillon
Technologies
Hadoop HA - HDFS
Using Zookeeper
docs.hortonworks.com
Hadoop Netxillon
Technologies
Zookeeper
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed
synchronization, and providing group services
Hadoop Netxillon
Technologies
Zookeeper Configuration
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
dataDir=/tmp/zookeeper
# the port at which the clients will connect
clientPort=2181
#
server.1=192.168.1.70:2888:3888
server.2=192.168.1.71:2888:3888
server.3=192.168.1.69:2888:3888
Hadoop Netxillon
Technologies
• Make sure zookeeper is up and coordinating.
• Start journal nodes.
• Format the Namenode
• Format the zkFC
Hadoop 2.0 HA Setup using QJM
Hadoop Netxillon
Technologies
DEMO
Hadoop 2.0 HA Setup using QJM
Hadoop Netxillon
Technologies
Hadoop 2.0 Setup
DEMO
Hadoop Netxillon
Technologies
Hadoop Upgrade
1. hadoop dfsadmin -upgradeProgress status
2. Stop all client applications running on the MapReduce cluster.
3. Perform a filesystem check
hadoop fsck / -files -blocks -locations > dfs-v-old-fsck-1.log
4. Save a complete listing of the HDFS namespace to a local file
hadoop dfs -lsr / > dfs-v-old-lsr-1.log
5. Create a list of DataNodes participating in the cluster:
hadoop dfsadmin -report > dfs-v-old-report-1.log
Hadoop Netxillon
Technologies
Hadoop Upgrade
6. Optionally backup HDFS data
7. Upgrade process:
Point to the new directory, update environment variables.
8. hadoop-daemon.sh start namenode -upgrade
9. hadoop dfsadmin -upgradeProgress status
10. Now start the datanode, after pointing to the new hadoop directory
11. hadoop dfsadmin -safemode get
12. hadoop dfsadmin -finalizeUpgrade
Hadoop Netxillon Technologies
Further Readings:
-http://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html
-https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithNFS.html
GitHub: https://github.com/netxillon/hadoop/tree/master/HA_QJM
Hadoop Netxillon Technologies
Further Reading:
• https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
• http://www.aosabook.org/en/hdfs.html
Hadoop Netxillon Technologies
Topics for Next Class:
• Hive, HBASE, PIG
• Sqoop, Flume
Hadoop Netxillon Technologies
Pre-Readings before the next class:
• https://hbase.apache.org/
• http://hortonworks.com/hadoop/hive/
• https://hive.apache.org/
• https://pig.apache.org/
Netxillon Technologies
Any Questions ?
Netxillon Technologies
GitHub: https://github.com/netxillon/hadoop
Thanks !
trainings@netxillon.com

More Related Content

What's hot

Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
Omid Vahdaty
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
Great Wide Open
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
Edureka!
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
tcurdt
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
DataWorks Summit
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
Kathleen Ting
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
Aneesh Pulickal Karunakaran
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Scott Miao
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Impetus Technologies
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
Adam Kawa
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
Edureka!
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
Douglas Bernardini
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
Hortonworks
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
Rommel Garcia
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Postgres in Amazon RDS
Postgres in Amazon RDSPostgres in Amazon RDS
Postgres in Amazon RDS
Denish Patel
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
Altoros
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
Jack Levin
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
Ming Ma
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon
 

What's hot (20)

Introduction to hadoop high availability
Introduction to hadoop high availability Introduction to hadoop high availability
Introduction to hadoop high availability
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
Hadoop Cluster With High Availability
Hadoop Cluster With High AvailabilityHadoop Cluster With High Availability
Hadoop Cluster With High Availability
 
Hadoop - Lessons Learned
Hadoop - Lessons LearnedHadoop - Lessons Learned
Hadoop - Lessons Learned
 
Improving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux ConfigurationImproving Hadoop Cluster Performance via Linux Configuration
Improving Hadoop Cluster Performance via Linux Configuration
 
Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)Hadoop Operations for Production Systems (Strata NYC)
Hadoop Operations for Production Systems (Strata NYC)
 
Hadoop administration
Hadoop administrationHadoop administration
Hadoop administration
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Hortonworks.Cluster Config Guide
Hortonworks.Cluster Config GuideHortonworks.Cluster Config Guide
Hortonworks.Cluster Config Guide
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
 
Postgres in Amazon RDS
Postgres in Amazon RDSPostgres in Amazon RDS
Postgres in Amazon RDS
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Hug Hbase Presentation.
Hug Hbase Presentation.Hug Hbase Presentation.
Hug Hbase Presentation.
 
Improving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of ServiceImproving HDFS Availability with Hadoop RPC Quality of Service
Improving HDFS Availability with Hadoop RPC Quality of Service
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 

Similar to ha_module5

hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
Gurmukh Singh
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
Big Data Joe™ Rossi
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Govt.Engineering college, Idukki
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
Edureka!
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Hortonworks
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
DataWorks Summit
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
Matteo Moretti
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
AnandMHadoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
Rajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
nvvrajesh
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
Cloudera, Inc.
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
Edureka!
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
Edureka!
 

Similar to ha_module5 (20)

hadoop_module6
hadoop_module6hadoop_module6
hadoop_module6
 
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - AltiscaleOC Big Data Monthly Meetup #5 - Session 1 - Altiscale
OC Big Data Monthly Meetup #5 - Session 1 - Altiscale
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...Dache: A Data Aware Caching for Big-Data Applications Usingthe MapReduce Fra...
Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce Fra...
 
Learn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node ClusterLearn to setup a Hadoop Multi Node Cluster
Learn to setup a Hadoop Multi Node Cluster
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Scaling PHP apps
Scaling PHP appsScaling PHP apps
Scaling PHP apps
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Session 01 - Into to Hadoop
Session 01 - Into to HadoopSession 01 - Into to Hadoop
Session 01 - Into to Hadoop
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Yarns About Yarn
Yarns About YarnYarns About Yarn
Yarns About Yarn
 
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | EdurekaWhat Is Hadoop | Hadoop Tutorial For Beginners | Edureka
What Is Hadoop | Hadoop Tutorial For Beginners | Edureka
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 

ha_module5