• Share
  • Email
  • Embed
  • Like
  • Private Content
Introduction to Cloudera's Administrator Training for Apache Hadoop
 

Introduction to Cloudera's Administrator Training for Apache Hadoop

on

  • 3,914 views

Learn who is best suited to attend the full Administrator Training, what prior knowledge you should have, and what topics the course covers. Cloudera Senior Curriculum Manager, Ian Wrigley, will ...

Learn who is best suited to attend the full Administrator Training, what prior knowledge you should have, and what topics the course covers. Cloudera Senior Curriculum Manager, Ian Wrigley, will discuss the skills you will attain during Admin Training and how they will help you move your Hadoop deployment from strategy to production and prepare for the Cloudera Certified Administrator for Apache Hadoop (CCAH) exam.

Statistics

Views

Total Views
3,914
Views on SlideShare
2,902
Embed Views
1,012

Actions

Likes
7
Downloads
173
Comments
0

8 Embeds 1,012

http://cloudera.com 547
http://www.cloudera.com 429
http://author01.mtv.cloudera.com 16
http://localhost 13
http://192.168.6.179 2
http://author01.core.cloudera.com 2
http://searchutil01 2
http://dschool.co 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • It’s perhaps more accurate to say that HDFS federation doesn’t change being a single point of failure much. If you have several volumes, it might be the case that the one that’s just failed isn’t the one you happen to need for a given job. On the other hand, if you have several NameNodes, the chance that any one of them might fail increases. We recommend using high-quality hardware for the master nodes, so NameNodes seldom fail. When they do, recovery is a straightforward process and there’s little chance for data loss (assuming administrators have configured things properly beforehand). There are many possible reasons for HDFS downtime (http://www.cloudera.com/blog/2011/02/hadoop-availability/), but these two are the most pertinent to our discussion.The best source of information on HA is Cloudera’s HDFS High Availability Guide (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/CDH4-High-Availability-Guide.html).There’s a good overview of Quorum Journal Manager-based HDFS HA here: http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1. Also some good information here: http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum. This link has HDFS HA design information: https://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf
  • The metadata referenced here includes the fsimage and edit log.Clients only ever contact the active NameNode and will use a “virtual” NameNode address that always resolves to the currently active NameNode (as described a few slides later). When HA is enabled, all Data Nodes in the cluster are configured with the network addresses of both NameNodes. Data Nodes send all block reports, block location updates, and heartbeats to both Name Nodes, but only the currently Active NameNode will send commands to the DataNodes to do things such as delete blocks.This section describes a “hot standby” which is ready to take over immediately when the active NameNode fails. It’s possible to have a “cold standby” (machine does not have access to the current state, and may even be powered off), but failure recovery will take far longer so this is not the preferred approach.In theory, it's possible to run more than two NameNodes. However, this has not been tested, and in practice no one is using more than two NNs in production.We still recommend using “carrier grade” hardware for the Active and Standby NameNodes. If you’re transitioning an existing cluster to HA, you can reuse the Secondary NameNode hardware for your standby NameNode (since there’s no Secondary NameNode in HA, as illustrated here and further described on the next slide).
  • The Active NameNode sends its edits to the JournalNodes via RPC; once it has an ACK from a majority of the JNs it is happy. The Standby NN reads from the JNs to ensure that its state is in sync with the Active NN. Paxos is the name of the algorithm used by the QJM and JNs to ensure that even if a JN fails as it's being written to, no edits are lost. Paxos is a well-known, well-tested distributed systems algorithm.In CDH4, there are some issues when you attempt to add additional new QJM nodes to an existing quorum. The workaround is using rsync to copy over the journal storage directory from a JournalNode already in existence and restarting.
  • This slide discusses the concepts related to failover. We will discuss the configuration process for failover in detail in upcoming slides.
  • There is a deployment diagram of HDFS HA with Automatic Failover on the next slide. You can either teach off of the current slide and then move to the diagram to reinforce what you covered, or you can spend as little time as possible on the bullets on the current slide and move directly to the diagram, whichever best suits your teaching style. A little detail about the bullet points:A ZooKeeper Failover Controller daemon runs on each NameNode machine. It monitors the NameNode and, if the NameNode fails, automatically fails HDFS over to the Standby. When the NameNode that was originally the Active NameNode comes back up, it comes up at the Standby NameNode. If desired, you can force a failback manually (using hdfs haadmin –failover).The ZooKeeper Failover Controller uses a replicated ZooKeeper ensemble to hold state. Note that the ZooKeeper Failover Controller is not a ZooKeeper server. It uses ZooKeeper to maintain state.In an HDFS HA automatic failover deployment, you will need to install, configure, and start a two NameNodes, JournalNodes, ZooKeeper servers, and ZooKeeper Failover Controllers.
  • Even if you are doing most of your teaching from the previous slide, point out the differences between the HDFS HA deployment diagram a few slides ago and this one. (You might ask your students to identify which components did not appear on the first HDFS HA diagram.) Without automatic failover, there are no ZooKeeper Failover Controllers and there is no ZooKeeper ensemble.Point out that this is the logical architecture. Remind students about the physical architecture noted in the bullets on the previous slide:ZooKeeper Failover Controllers must run on the same hosts as the Active and Standby NameNodesThe ZooKeeper servers can run on any nodes in the clusterThe JournalNodes can run on any nodes as wellCloudera Solutions Architecture’s best practice as of this writing is to co-locate all of these servers on the Master nodes. For example, you might distribute the ZooKeeper servers and the JournalNodes across the hosts running the NameNodes and the JobTracker. These servers are critical to the success of HDFS HA and while the system has redundancies built in to tolerate node failure, it’s best if you can place these important servers on ultra-reliable hardware. It’s also recommended to provide separate disks for each ZooKeeper server, and separate disks for each JournalNode.
  • ln is the natural logarithmTo determine the recommended baseline value for dfs.namenode.handler.count, you can go to a site like http://www.rapidtables.com/calc/math/Ln_Calc.htmto get the natural logarithm of a number, or you can run a Python function such as the following:python -c 'import math; print int(math.log(200) * 20)'The preceding command will work from the terminal windows in students’ lab environments, and it will work on Macs.
  • Note that if trash is enabled on the server configuration, then the value configured on the server is used and the client configuration is ignored. If trash is disabled in the server configuration, then the client side configuration is checked.
  • The notes for dfs.namenode.handler.count have information about how to obtain the natural logarithm for the number of nodes.

Introduction to Cloudera's Administrator Training for Apache Hadoop Introduction to Cloudera's Administrator Training for Apache Hadoop Presentation Transcript

  • 1 An Introduction to Cloudera’s Administrator Training for Apache Hadoop Ian Wrigley Sr. Curriculum Manager ian@cloudera.com
  • 2© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Why Take Cloudera Training?  Administrator Course Contents  A Deeper Dive: An overview of HDFS High Availability  A Deeper Dive: Some of Hadoop’s advanced configuration options  Question time Topics
  • 3 1 Broadest Range of Courses Developer, Admin, Analyst, HBase, Data Science 2 3 Most Experienced Instructors Over 15,000 students trained since 2009 5 Widest Geographic Coverage Most classes offered: 50 cities worldwide plus online 6 Most Relevant Platform & Community CDH deployed more than all other distributions combined 7 Depth of Training Material Hands-on exercises and VMs support live instruction Leader in Certification Over 5,000 accredited Cloudera professionals 4 State of the Art Curriculum Classes updated regularly as Hadoop evolves 8 Ongoing Learning Video tutorials and e-learning complement training Why Cloudera Training?
  • 4 Data Analyst Training Implement massively distributed, columnar storage at scale Enable random, real-time read/write access to all data HBase Training Configure, install, and monitor clusters for optimal performance Implement security measures and multi-user functionality Vertically integrate basic analytics into data management Transform and manipulate data to drive high-value utilization Enterprise Training Use Cloudera Manager to speed deployment and scale the cluster Learn which tools and techniques improve cluster performance Learning Path: System Administrators Administrator Training
  • 5© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Why Take Training?  Administrator Course Contents  A Deeper Dive: An overview of HDFS High Availability  A Deeper Dive: Some of Hadoop’s advanced configuration options  Question time Topics
  • 6© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. During the Administrator course, you learn:  The core technologies of Hadoop  How to populate HDFS from external sources  How to plan your Hadoop cluster hardware and software  How to deploy a Hadoop cluster  What issues to consider when installing Pig, Hive, and Impala  What issues to consider when deploying Hadoop clients  How Cloudera Manager can simplify Hadoop administration  How to configure HDFS for high availability  What issues to consider when implementing Hadoop security Administrator Course Objectives
  • 7© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  How to schedule jobs on the cluster  How to maintain your cluster  How to monitor, troubleshoot, and optimize the cluster Administrator Course Objectives (cont’d)
  • 8© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  The course features many Hands-On Exercises, including: –Deploying Hadoop in pseudo-distributed mode –Deploying a complete, multi-node Hadoop cluster –Importing data into HDFS using Sqoop and Flume –Installing Hive and Impala –Using Hue to control user access –Configuring HDFS High Availability –Configuring the FairScheduler –Troubleshooting problems on the cluster –… and more Hands-On Exercises
  • 9© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. Course Chapters  Introduction  Planning Your Hadoop Cluster  Hadoop Installation and Initial Configuration  Installing and Configuring Hive, Impala, and Pig  Hadoop Clients  Cloudera Manager  Advanced Cluster Configuration  Hadoop Security Introduction to Apache Hadoop Planning, Installing, and Configuring a Hadoop Cluster Course Introduction  The Case for Apache Hadoop  HDFS  Getting Data Into HDFS  MapReduce  Managing and Scheduling Jobs  Cluster Maintenance  Cluster Monitoring and Troubleshooting  Conclusion  Kerberos Configuration  Configuring HDFS Federation Cluster Operations and Maintenance Course Conclusion and Appendices
  • 10© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Why Take Training?  Administrator Course Contents  A Deeper Dive: An overview of HDFS High Availability  A Deeper Dive: Some of Hadoop’s advanced configuration options  Question time Topics
  • 11© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  A single NameNode is a single point of failure  Two ways a NameNode can result in HDFS downtime –Unexpected NameNode crash (rare) –Planned maintenance of NameNode (more common)  HDFS High Availability (HA) eliminates this SPOF –Available in CDH4 (or related Apache Hadoop 0.23.x, and 2.x) HDFS High Availability Overview
  • 12© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  HDFS High Availability uses a pair of NameNodes –One Active and one Standby –Clients only contact the Active NameNode –DataNodes heartbeat in to both NameNodes –Active NameNode writes its metadata to a quorum of JournalNodes –Standby NameNode reads from the JournalNodes to remain in sync with the Active NameNode HDFS High Availability Architecture NameNode (Active)/Quorum Journal Manager DataNode DataNode DataNode DataNode NameNode (Standby)/Quorum Journal Manager JournalNode JournalNode JournalNode
  • 13© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Active NameNode writes edits to the JournalNodes –Software to do this is the Quorum Journal Manager (QJM) –Built in to the NameNode –Waits for a success acknowledgment from the majority of JournalNodes –Majority commit means a single crashed or lagging JournalNode will not impact NameNode latency –Uses the Paxos algorithm to ensure reliability even if edits are being written as a JournalNode fails  Note that there is no Secondary NameNode when implementing HDFS High Availability –The Standby NameNode periodically performs checkpointing HDFS High Availability Architecture (cont’d)
  • 14© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Only one NameNode must be active at any given time –The other is in standby mode  The standby maintains a copy of the active NameNode’s state –So it can take over when the active NameNode goes down  Two types of failover –Manual (detected and initiated by a user) –Automatic (detected and initiated by HDFS itself) Failover
  • 15© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Automatic failover is based on Apache ZooKeeper –A coordination service system also used by HBase –An open source Apache project –One of the components in CDH  A daemon called the ZooKeeper Failover Controller (ZKFC) runs on each NameNode machine  ZooKeeper needs a quorum of nodes –Typical installations use three or five nodes –Low resource usage –Can install alongside existing master daemons Automatic Failover
  • 16© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. HDFS HA With Automatic Failover – Deployment DataNodeDataNode DataNodeDataNode JournalNode JournalNode JournalNode ZooKeeper Ensemble - Instances Typically Reside on Master Nodes NameNode (Active)/Quorum Journal Manager ZooKeeper Failover Controller NameNode (Standby)/ Quorum Journal Manager ZooKeeper Failover Controller ZooKeeperZooKeeper ZooKeeper Must Reside on the Same Host JournalNodes Typically Reside on Master Nodes Must Reside on the Same Host
  • 17© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Why Take Training?  Administrator Course Contents  A Deeper Dive: An overview of HDFS High Availability  A Deeper Dive: Some of Hadoop’s more advanced configuration options  Question time Topics
  • 18© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. hdfs-site.xml dfs.namenode.handler.count The number of threads the NameNode uses to handle RPC requests from DataNodes. Default: 10. Recommended: ln(number of cluster nodes) * 20. Symptoms of this being set too low: ‘connection refused’ messages in DataNode logs as they try to transmit block reports to the NameNode. Used by the NameNode. dfs.datanode.failed. volumes.tolerated The number of volumes allowed to fail before the DataNode takes itself offline, ultimately resulting in all of its blocks being re-replicated. Default: 0, but often increased on machines with several disks. Used by DataNodes.
  • 19© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. core-site.xml fs.trash.interval When a file is deleted, it is placed in a .Trash directory in the user’s home directory, rather than being immediately deleted. It is purged from HDFS after the number of minutes specified. Default: 0 (disabled). Recommended: 1440 (one day). Used by clients and the NameNode.
  • 20© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. mapred-site.xml mapred.job.tracker. handler.count Number of threads used by the JobTracker to respond to heartbeats from the TaskTrackers. Default: 10. Recommendation: ln(number of cluster nodes) * 20. Used by the JobTracker. mapred.reduce.parallel. copies Number of TaskTrackers a Reducer can connect to in parallel to transfer its data. Default: 5. Recommendation: ln(number of cluster nodes) * 4 with a floor of 10. Used by TaskTrackers. tasktracker.http.threads The number of HTTP threads in the TaskTracker which the Reducers use to retrieve data. Default: 40. Recommendation: 80. Used by TaskTrackers.
  • 21© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent. mapred-site.xml (cont’d) mapred.reduce.slowstart. completed.maps The percentage of Map tasks which must be completed before the JobTracker will schedule Reducers on the cluster. Default: 0.05 (5 percent). Recommendation: 0.8 (80 percent). Used by the JobTracker.
  • 22© Copyright 2010-2013 Cloudera. All rights reserved. Not to be reproduced without prior written consent.  Why Take Training?  Administrator Course Contents  A Deeper Dive: An overview of HDFS High Availability  A Deeper Dive: Some of Hadoop’s more advanced configuration options  Question time Topics
  • 23
  • 24 • Submit questions in the Q&A panel • Watch on-demand video of this webinar and many more at http://cloudera.com • Follow Ian on Twitter @iwrigley • Follow Cloudera University @ClouderaU • Learn more at Strata + Hadoop World: http://tinyurl.com/hadoopworld • Thank you for attending! Register now for Cloudera training at http://university.cloudera.com Use discount code Admin_10 to save 10% on new enrollments in Administrator Training classes delivered by Cloudera until December 1, 2013* Use discount code 15off2 to save 15% on enrollments in two or more training classes delivered by Cloudera until December 1, 2013* * Excludes classes sold or delivered by Cloudera partners