Learn Hadoop Administration


Published on

The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.

1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • NameNode Single Point of Failure
  • In order to scale the name service horizontally, federation uses multiple independent Namenodes/namespaces. The Namenodes are federated, that is, the Namenodes are independent and don’t require coordination with each other. The datanodes are used as common storage for blocks by all the Namenodes. Each datanode registers with all the Namenodes in the cluster. Datanodes send periodic heartbeats and block reports and handles commands from the Namenodes.
  • In a typical HA cluster, two separate machines are configured as NameNodes. At any point in time, exactly one of the NameNodes is in an Active state, and the other is in aStandby state. The Active NameNode is responsible for all client operations in the cluster, while the Standby is simply acting as a slave, maintaining enough state to provide a fast failover if necessary.In order for the Standby node to keep its state synchronized with the Active node, the current implementation requires that the two nodes both have access to a directory on a shared storage device (eg an NFS mount from a NAS). This restriction will likely be relaxed in future versions.When any namespace modification is performed by the Active node, it durably logs a record of the modification to an edit log file stored in the shared directory. The Standby node is constantly watching this directory for edits, and as it sees the edits, it applies them to its own namespace. In the event of a failover, the Standby will ensure that it has read all of the edits from the shared storage before promoting itself to the Active state. This ensures that the namespace state is fully synchronized before a failover occurs.In order to provide a fast failover, it is also necessary that the Standby node have up-to-date information regarding the location of blocks in the cluster. In order to achieve this, the DataNodes are configured with the location of both NameNodes, and send block location information and heartbeats to both.It is vital for the correct operation of an HA cluster that only one of the NameNodes be Active at a time. Otherwise, the namespace state would quickly diverge between the two, risking data loss or other incorrect results. In order to ensure this property and prevent the so-called "split-brain scenario," the administrator must configure at least one fencing method for the shared storage. During a failover, if it cannot be verified that the previous Active node has relinquished its Active state, the fencing process is responsible for cutting off the previous Active's access to the shared edits storage. This prevents it from making any further edits to the namespace, allowing the new Active to safely proceed with failover.More:http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/HDFSHighAvailabilityWithNFS.html
  • Apache Hadoop NextGen MapReduce (YARN)MapReduce has undergone a complete overhaul in hadoop-0.23 and we now have, what we call, MapReduce 2.0 (MRv2) or YARN.The fundamental idea of MRv2 is to split up the two major functionalities of the JobTracker, resource management and job scheduling/monitoring, into separate daemons. The idea is to have a global ResourceManager (RM) and per-application ApplicationMaster (AM). An application is either a single job in the classical sense of Map-Reduce jobs or a DAG of jobs.The ResourceManager and per-node slave, the NodeManager (NM), form the data-computation framework. The ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system.The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.
  • These screen shots are from a single node CDH4 cluster, so whatever configuration not completed can be found in conf.empty.
  • You can still use fs.default.name
  • No Hadoop 2.0in ProductionBelow is the configuration of a live production cluster (CDH3):For NameNode:RAM:  64 GB,Hard disk: 1 TBProcessor: Xenon with 8 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOSFor DataNode:RAM: 16GBHard disk: 6 X 2TBProcessor: Xenon with 2 cores.Ethernet: 3 X 10 GB/sOS: 32bit CentOSFor Secondary-NameNode:RAM:  32 GB,Hard disk: 1 TBProcessor: Xenon with 4 CoresEthernet: 3 X 10 GB/sOS: 32bit CentOSThe above configuration deals with about 10-15 TB of data per customer on an average and the company has 3-4 customers who were using this functionality and we found that it was serving us well. This environment also performs some real complex queries on this by slicing and dicing the data.
  • To choose the right hardware, software and configuration for your production Hadoop Cluster, you need to do your homework on the the Hadoop distribution vendor (For example Cloudera or Apache Hadoop ), Usage or Workload Pattern, Cluster Size, additional Hadoop ecosystem components such as HBase etc (in addition to Size of the Data).Based on this analysis you need to decided hardware for each server nodes. 
  • Recommend Hadoop Operations by Eric Sammer as reference guidehttp://www.amazon.com/Hadoop-Operations-Eric-Sammer/dp/1449327052.
  • Learn Hadoop Administration

    1. 1. www.edureka.in/hadoop-admin
    2. 2. www.edureka.in/hadoop-admin Course Topics  Week 1 – Understanding Big Data – A typical Hadoop Cluster – Hadoop Cluster Administrator: Roles and Responsibilities  Week 2 – Hadoop 2.0 – Hadoop Configuration files – Popular Hadoop Distributions  Week 3 – Different Hadoop Server Roles – Data processing flow – Cluster Network Configuration  Week 4 – Job Scheduling – Fair Scheduler – Monitoring a Hadoop Cluster  Week 5 – Securing your Hadoop Cluster – Kerberos and HDFS Federation – Backup and Recovery  Week 6 – Oozie and Hive Administration – HBase Architecture – HBase Administration
    3. 3. www.edureka.in/hadoop-admin Topics for Today  Revision  Hadoop 2.0  Hadoop Configuration Files  Plan your Hadoop Cluster: Hardware Considerations  Plan your Hadoop Cluster: Software Considerations  Popular Hadoop Distributions
    4. 4. www.edureka.in/hadoop-admin  Hadoop Core Components  Different Cluster Modes Lets’s Revise
    5. 5. www.edureka.in/hadoop-admin Client HDFS Map Reduce Hadoop 1.0 Secondary Name Node Data Blocks Data Node Name Node Job Tracker Task Tracker Map Reduce Data Node Task Tracker Map Reduce ….
    6. 6. www.edureka.in/hadoop-admin Hadoop 1.0 Vs. Hadoop 2.0 Property Hadoop 1.x Hadoop 2.x NameNodes 1 Many High Availability Not present Highly Available Processing Control JobTracker, Task Tracker Resource Manager, Node Manager, App Master
    7. 7. www.edureka.in/hadoop-admin Hadoop 2.0 HDFS Federation http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/Federation.html Namenode Block Management NS Storage Datanode Datanode… NamespaceBlockStorage Namespace NS1 NSk NSn NN-1 NN-k NN-n Common Storage Datanode 1 … Datanode 2 … Datanode m … BlockStorage Pool 1 Pool k Pool n Block Pools … …
    8. 8. www.edureka.in/hadoop-admin Hadoop 2.0 HDFS NameNode High Availability Shared edit logs Data Blocks …. Data Nodes are configured with the location of both Name Nodes, and send block location information and heartbeats to both. Read edit logs and applies to its own namespace All name space edits logged to shared NFS storage; single writer (fencing) Active Name Node Standby Name Node Data Node Data Node Data Node Data Node Secondary Name Node
    9. 9. www.edureka.in/hadoop-admin Hadoop 2.0 : YARN or MapReduce 2.0 (MRv2) http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html YARN = Yet Another Resource Manager Node Manager Container Container Node Manager App Master Container Node Manager Container App Master Resource Manager Client Client MapReduce Status Job Submission Node Status Resource Request
    10. 10. www.edureka.in/hadoop-admin Client HDFS YARN Resource Manager Hadoop 2.0 Shared edit logs All name space edits logged to shared NFS storage; single writer (fencing) Read edit logs and applies to its own namespace Secondary Name Node Data Node Data Node Data Node Data Node Node Manager Container App Master Node Manager Container App Master Standby NameNode Node Manager Container App Master Node Manager Container App Master Active NameNode
    11. 11. Poll Questions
    12. 12. www.edureka.in/hadoop-admin Hadoop 2.0 Configuration Files Configuration Filenames Description of Log Files hadoop-env.sh yarn-env.sh Settings for Hadoop Daemon‟s process environment. core-site.xml Configuration settings for Hadoop Core such as I/O settings that common to both HDFS and YARN. hdfs-site.xml Configuration settings for HDFS Daemons, the Name Node and the Data Nodes. yarn-site.xml Configuration setting for ResourceManager and NodeManager. mapred-site.xml Configuration settings for MapReduce Applications. slaves A list of machines (one per line) that each run DataNode and NodeManager.
    13. 13. www.edureka.in/hadoop-admin Hadoop 2.0 Configuration Files
    14. 14. www.edureka.in/hadoop-admin Deprecated Properties Deprecated Property Name New Property Name dfs.data.dir dfs.datanode.data.dir dfs.http.address dfs.namenode.http-address fs.default.name fs.defaultFS The core functionality and usage of these core configuration files are same in Hadoop 2.0 and 1.0 but many new properties have been added and many have been deprecated. For example:  ‟fs.default.name‟ has been deprecated and replaced with „fs.defaultFS‟ for YARN in core-site.xml  „dfs.nameservices‟ has been added to enable NameNode High Availability in hdfs-site.xml http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/DeprecatedProperties.html  In Hadoop 2.x.x (CDH4) release, you can use either the old or the new properties. – The old property names are now deprecated, but still work!
    15. 15. www.edureka.in/hadoop-admin Runtime Environment  Offers a way to provide custom parameters for each of the servers.  Sourced by the Hadoop Daemons start/stop scripts.  Examples of environment variables that you can specify: HADOOP_DATANODE_HEAPSIZE YARN_HEAPSIZE Set parameter JAVA_HOME JVM hadoop-env.sh yarn-env.sh Map Reduce
    16. 16. www.edureka.in/hadoop-admin Configuration Files for Core Components Core core-site.xml HDFS hdfs-site.xml mapred-site.xml Map Reduce yarn-site.xmlYARN
    17. 17. www.edureka.in/hadoop-admin core-site.xml and hdfs-site.xml hdfs-site.xml core-site.xml <?xml version - "1.0"?> <?xml version ="1.0"?> <!--hdfs-site.xml--> <!--core-site.xml--> <configuration> <configuration> <property> <property> <name>dfs.replication</name> <name>fs.defaultFS</name> <value>1</value> <value>hdfs://test.abc.in:8020/</value> </property> </property> </configuration> </configuration> http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/core-default.xml http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
    18. 18. www.edureka.in/hadoop-admin mapred-site.xml mapred-site.xml <?xml version=“1.0”?> <configuration> <property> <name>mapreduce.jobhistory.address</name> <value>test.abc.in:10020</value> <property> </configuration> http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml http://hadoop.apache.org/docs/stable/mapred_tutorial.html Notice difference in URL for current and stable release
    19. 19. www.edureka.in/hadoop-admin yarn-site.xml yarn-site.xml <?xml version=“1.0”?> <configuration> <property> <name>yarn.resourcemanager.address</name> <value>test.abc.in:8021</value> <property> </configuration> http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml
    20. 20. www.edureka.in/hadoop-admin Slaves Map Reduce Slaves  Contains a list of slave hosts, one per line, that are to host DataNode and NodeManager servers.
    21. 21. www.edureka.in/hadoop-adminhttp://wiki.apache.org/hadoop/PoweredBy Hadoop Cluster: Facebook
    22. 22. www.edureka.in/hadoop-admin Hadoop Cluster: A Typical Use Case (Hadoop 1.0) RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 32bit CentOS RAM: 64 GB, Hard disk: 1 TB Processor: Xenon with 8 Cores Ethernet: 3 X 10 GB/s OS: 32bit CentOS RAM: 32 GB, Hard disk: 1 TB Processor: Xenon with 4 Cores Ethernet: 3 X 10 GB/s OS: 32bit CentOS Name Node Secondary Name Node Data Node RAM: 16GB Hard disk: 6 X 2TB Processor: Xenon with 2 cores. Ethernet: 3 X 10 GB/s OS: 32bit CentOS Data Node
    23. 23. www.edureka.in/hadoop-admin Hadoop Cluster: Thinking About The Problem Single Machine  Great for testing, developing.  Not a practical implementation for large amounts of data.  Initially four or six nodes.  As the volume of data grows, more nodes can easily be added. Ways of deciding when the cluster needs to grow  Increasing amount of computation power needed.  Increasing amount of data which needs to be stored.  Increasing amount of memory needed to process tasks. Hadoop Cluster Small Cluster Large Cluster
    24. 24. www.edureka.in/hadoop-admin  Master Hardware  Namenode requirements  RAM to fit metadata  Modest but dedicated disk  Secondary Namenode  Almost identical to Namenode  Resource Manager  Retain Job Data, Memory Hungry  Memory requirements can grow independent of cluster size  Slave Hardware  Storage  Computation  Cluster Sizing  Usage Pattern and Workloads  IO-bound or CPU-bound  Consider requirements for additional components such as HBase Plan your Hadoop Cluster: Hardware
    25. 25. www.edureka.in/hadoop-admin  Operating System  Linux is the only production quality option today.  A significant number run on RHEL.  Java  JDK- the most critical software  List of tested JVMs: http://wiki.apache.org/hadoop/HadoopJavaVers ions  Java 1.6.x  Operating System utilities  ssh  cron  rsync  ntp Plan your Hadoop Cluster: Software
    26. 26. www.edureka.in/hadoop-admin  Choose a Distribution and Version of Hadoop Popular Hadoop Distributions  Apache Hadoop  Complex Cluster setup  Manual install and Integration of Hadoop ecosystem components such as Pig, Hive, HBase etc  No commercial Support  Good for First try  Cloudera  Established distribution with many referenced deployments  Powerful tools for deployment, management and monitoring such as Cloudera Manager
    27. 27. www.edureka.in/hadoop-admin  HortonWorks  Only distribution without any modification in Apache Hadoop  HCatalog for metadata  Stinger for Hive  MapR  Support native Unix filesystem  HA features such as snapshots, mirroring or stateful failover  Amazon Elastic Map Reduce (EMR)  Hosted Solution  Only Pig and Hive are available as of now Popular Hadoop Distributions
    28. 28. www.edureka.in/hadoop-admin Assignments – Status  Attempt the following Assignments using the documents present in the LMS:  Install single-node Apache Hadoop 2.0 using a Virtual Machine in VMPlayer or VirtualBox.
    29. 29. Thank You See You in Class Next Week