The document provides an overview of Cloudera's Administrator Training course for Apache Hadoop. The training covers topics such as planning and deploying Hadoop clusters, installing and configuring Hadoop components like HDFS, Hive and Impala, using Cloudera Manager for administration, configuring advanced cluster options and HDFS high availability, and Hadoop security. The hands-on course includes exercises for deploying Hadoop clusters, importing data, and troubleshooting issues.
3. 3
1 Broadest Range of Courses
Developer, Admin, Analyst, HBase, Data Science
2
3
Most Experienced Instructors
Over 15,000 students trained since 2009
5 Widest Geographic Coverage
Most classes offered: 50 cities worldwide plus online
6 Most Relevant Platform & Community
CDH deployed more than all other distributions combined
7 Depth of Training Material
Hands-on exercises and VMs support live instruction
Leader in Certification
Over 5,000 accredited Cloudera professionals
4 State of the Art Curriculum
Classes updated regularly as Hadoop evolves 8 Ongoing Learning
Video tutorials and e-learning complement training
Why Cloudera Training?
4. 4
Data Analyst
Training
Implement massively distributed, columnar storage at scale
Enable random, real-time read/write access to all data
HBase
Training
Configure, install, and monitor clusters for optimal performance
Implement security measures and multi-user functionality
Vertically integrate basic analytics into data management
Transform and manipulate data to drive high-value utilization
Enterprise
Training
Use Cloudera Manager to speed deployment and scale the cluster
Learn which tools and techniques improve cluster performance
Learning Path: System Administrators
Administrator
Training
24. 24
• Submit questions in the Q&A panel
• Watch on-demand video of this webinar
and many more at http://cloudera.com
• Follow Ian on Twitter @iwrigley
• Follow Cloudera University @ClouderaU
• Learn more at Strata + Hadoop World:
http://tinyurl.com/hadoopworld
• Thank you for attending!
Register now for Cloudera training at
http://university.cloudera.com
Use discount code Admin_10 to save
10% on new enrollments in
Administrator Training classes delivered
by Cloudera until December 1, 2013*
Use discount code 15off2 to save 15% on
enrollments in two or more training
classes delivered by Cloudera until
December 1, 2013*
* Excludes classes sold or delivered by Cloudera partners
Editor's Notes
It’s perhaps more accurate to say that HDFS federation doesn’t change being a single point of failure much. If you have several volumes, it might be the case that the one that’s just failed isn’t the one you happen to need for a given job. On the other hand, if you have several NameNodes, the chance that any one of them might fail increases. We recommend using high-quality hardware for the master nodes, so NameNodes seldom fail. When they do, recovery is a straightforward process and there’s little chance for data loss (assuming administrators have configured things properly beforehand). There are many possible reasons for HDFS downtime (http://www.cloudera.com/blog/2011/02/hadoop-availability/), but these two are the most pertinent to our discussion.The best source of information on HA is Cloudera’s HDFS High Availability Guide (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-High-Availability-Guide/CDH4-High-Availability-Guide.html).There’s a good overview of Quorum Journal Manager-based HDFS HA here: http://blog.cloudera.com/blog/2012/10/quorum-based-journaling-in-cdh4-1. Also some good information here: http://www.slideshare.net/cloudera/hdfs-update-lipcon-federal-big-data-apache-hadoop-forum. This link has HDFS HA design information: https://issues.apache.org/jira/secure/attachment/12547598/qjournal-design.pdf
The metadata referenced here includes the fsimage and edit log.Clients only ever contact the active NameNode and will use a “virtual” NameNode address that always resolves to the currently active NameNode (as described a few slides later). When HA is enabled, all Data Nodes in the cluster are configured with the network addresses of both NameNodes. Data Nodes send all block reports, block location updates, and heartbeats to both Name Nodes, but only the currently Active NameNode will send commands to the DataNodes to do things such as delete blocks.This section describes a “hot standby” which is ready to take over immediately when the active NameNode fails. It’s possible to have a “cold standby” (machine does not have access to the current state, and may even be powered off), but failure recovery will take far longer so this is not the preferred approach.In theory, it's possible to run more than two NameNodes. However, this has not been tested, and in practice no one is using more than two NNs in production.We still recommend using “carrier grade” hardware for the Active and Standby NameNodes. If you’re transitioning an existing cluster to HA, you can reuse the Secondary NameNode hardware for your standby NameNode (since there’s no Secondary NameNode in HA, as illustrated here and further described on the next slide).
The Active NameNode sends its edits to the JournalNodes via RPC; once it has an ACK from a majority of the JNs it is happy. The Standby NN reads from the JNs to ensure that its state is in sync with the Active NN. Paxos is the name of the algorithm used by the QJM and JNs to ensure that even if a JN fails as it's being written to, no edits are lost. Paxos is a well-known, well-tested distributed systems algorithm.In CDH4, there are some issues when you attempt to add additional new QJM nodes to an existing quorum. The workaround is using rsync to copy over the journal storage directory from a JournalNode already in existence and restarting.
This slide discusses the concepts related to failover. We will discuss the configuration process for failover in detail in upcoming slides.
There is a deployment diagram of HDFS HA with Automatic Failover on the next slide. You can either teach off of the current slide and then move to the diagram to reinforce what you covered, or you can spend as little time as possible on the bullets on the current slide and move directly to the diagram, whichever best suits your teaching style. A little detail about the bullet points:A ZooKeeper Failover Controller daemon runs on each NameNode machine. It monitors the NameNode and, if the NameNode fails, automatically fails HDFS over to the Standby. When the NameNode that was originally the Active NameNode comes back up, it comes up at the Standby NameNode. If desired, you can force a failback manually (using hdfs haadmin –failover).The ZooKeeper Failover Controller uses a replicated ZooKeeper ensemble to hold state. Note that the ZooKeeper Failover Controller is not a ZooKeeper server. It uses ZooKeeper to maintain state.In an HDFS HA automatic failover deployment, you will need to install, configure, and start a two NameNodes, JournalNodes, ZooKeeper servers, and ZooKeeper Failover Controllers.
Even if you are doing most of your teaching from the previous slide, point out the differences between the HDFS HA deployment diagram a few slides ago and this one. (You might ask your students to identify which components did not appear on the first HDFS HA diagram.) Without automatic failover, there are no ZooKeeper Failover Controllers and there is no ZooKeeper ensemble.Point out that this is the logical architecture. Remind students about the physical architecture noted in the bullets on the previous slide:ZooKeeper Failover Controllers must run on the same hosts as the Active and Standby NameNodesThe ZooKeeper servers can run on any nodes in the clusterThe JournalNodes can run on any nodes as wellCloudera Solutions Architecture’s best practice as of this writing is to co-locate all of these servers on the Master nodes. For example, you might distribute the ZooKeeper servers and the JournalNodes across the hosts running the NameNodes and the JobTracker. These servers are critical to the success of HDFS HA and while the system has redundancies built in to tolerate node failure, it’s best if you can place these important servers on ultra-reliable hardware. It’s also recommended to provide separate disks for each ZooKeeper server, and separate disks for each JournalNode.
ln is the natural logarithmTo determine the recommended baseline value for dfs.namenode.handler.count, you can go to a site like http://www.rapidtables.com/calc/math/Ln_Calc.htmto get the natural logarithm of a number, or you can run a Python function such as the following:python -c 'import math; print int(math.log(200) * 20)'The preceding command will work from the terminal windows in students’ lab environments, and it will work on Macs.
Note that if trash is enabled on the server configuration, then the value configured on the server is used and the client configuration is ignored. If trash is disabled in the server configuration, then the client side configuration is checked.
The notes for dfs.namenode.handler.count have information about how to obtain the natural logarithm for the number of nodes.