• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Deploying Grid Services Using Apache Hadoop
 

Deploying Grid Services Using Apache Hadoop

on

  • 1,188 views

An old presentation on how Yahoo! (at that point in time) had Hadoop deployed and some of our future infrastructure ideas.

An old presentation on how Yahoo! (at that point in time) had Hadoop deployed and some of our future infrastructure ideas.

Statistics

Views

Total Views
1,188
Views on SlideShare
1,183
Embed Views
5

Actions

Likes
1
Downloads
22
Comments
0

2 Embeds 5

http://www.linkedin.com 4
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • You don’t log into thousands of nodes to run a job, so then you install clustering software, but that isn’t that interesting unless you have data to go with it. Now that you have a lot of data, you need some place to interact with that data and you need to have the tools and know-how to actually do your work.
  • three things that a mapreduce developer needs to provide: two functions and the set of inputs. inputs are generally large--think terabytes. reducer functions will generally produce multiple output files. all of this is happening in parallel. think of cat asterisk | grep | sort | uniq, but being running over tens, hundreds, thousands of machines at once.
  • so in hadoop land, the combination of the map function, the reduce function, and the list of inputs constitutes ‘a job’. this job is submitted by our user, who we’ll call jerry, to this entity called the job tracker. the job tracker is typically configured so that it is on a dedicated machine in larger clusters or on a machine that might be shared with other hadoop framework dedicated processes on smaller clusters. anyway, the job tracker takes our job, splits all of the tasks up and hands it to processes called task trackers. task trackers live on every ‘worker’ node in the cluster. these task trackers basically watch the individual tasks and reports back to the job tracker the status of the tasks. if a task fails, the job tracker will try to reschedule that task.
  • Hadoop also includes a distributed file system. some key points here. clients, like the process being run by our user jerry pulls metadata from a process called the name node and the actual data from a process called the data node. The data node reads the data from local storage, typically stored on multiple “real” file systems on the host machine, such as ext3, zfs, or ufs.
  • before we get into what we’ve done to turn hadoop into a grid service, a bit of a warning. in some cases, the configurations and solutions we’re using are meant to scale to really large numbers of jobs and/or nodes and/or users. for your own installation, a lot of this just might be overkill. so don’t take this as “the” way to use hadoop. instead, view this as “a” way. and, in fact, you’ll discover that amongst the existing hadoop community, we’re on the extreme end of just about everything.
  • so given the previous general outline of hadoop, how do yahoo!’s engineers access grid services? as i mentioned earlier you can’t log into thousands of machines. instead, we provide a landing zone of sorts that users log into to launch their jobs and do other grid-related work. we call these machines gateways. gateway nodes are geared towards solving two problems: reduce development time by providing fast access to the grid and to provide some security, especially against accidental data leaks. currently, hadoop doesn’t have a very strong security model, so it takes some extra work outside of the hadoop configuration to make it somewhat protected from intentional or even accidental exposure.
  • so if the users log into the gateways to launch jobs, where do the jobs actually run? we have similarly configured machines that we call the compute cluster. the compute cluster nodes and the name node associated with that cluster are provisioned such that they are both HDFS and MR machines. We try to keep the machines relatively homogenous to cut down on the amount of custom configuration. so while i list the specs for the hardware that we use, note that on a given compute cluster, they will all generally have the same exact same configuration. the one big exception is the name node. since more memory means more and larger files, we make it a bit beefier in order to have a bigger HDFS.
  • so now we know where a user does the work, and the configuration of the nodes that do that work, but how does the user submit a job to the system? because of the number of users and the types of jobs they submit we ran into a bit of a problem. hadoop’s built-in scheduler just isn’t that advanced. there is no guarantee of job isolation or service levels or any of those other tools that admins like to utilize. we clearly had to do something in the interim to get us over this bump in the road. the obvious and quick answer was to look at what the HPC people were doing.
  • Enter Hadoop on Demand or HoD. It utilizes as the basis of its system a set of applications called torque and maui. by using HoD in combination with torque and maui we get those important features like job isolation, but at a pretty big cost. We lost some significant functionality, gained quite a bit of complexity, and even worse, lost a significant amount of efficiency due to nodes that weren’t able to be utilized for other work even though they were idle. to that end, the hadoop development community is looking at what can be done to provide this sort of functionality into the hadoop scheduling system. so if you are interested in those sorts of problems, please help contribute. :) [If it gets asked: why not condor? Had to choose something, and torque was simpler, with a fairly centralized administration system and widespread adoption in the HPC community.]
  • to give you an idea of the system works and some of the problems assoicated with it, lets take a look at three users using a 12 node cluster. our first user sue requests 4 nodes. 1 node goes to run her private job tracker. the remaining three nodes are used to run her task. so, sue has her job up and running and then jerry comes along. he needs a bit more computing power, so he requests 5 nodes. 1 node for the job tracker and 4 nodes for the task tracker. at this point, we only have two nodes free, since one node is being used for the name node. now david comes along. he’s a bit of a power user, so he needs 6 nodes. one node for his job tracker and 5 nodes for his task trackers. but since there are only two nodes completely free, he has to wait. if sue and jerry’s jobs are using only a single reducer or a single mapper, those other nodes might actually be free. but since we dedicate every machine for the lifetime of the entire job, david may be waiting for resources that are actually free.
  • our grids typically have more than 12 nodes and user jobs are usually measured in the hundreds of nodes. anywhere 10-20% are down. if we re-scale the picture for 1000 nodes, the picture looks more like this.
  • So, let’s put the whole system together. our user log into gateway machines. They use hod to establish private jobtrackers for themselves. hod contacts on behalf of the user the resource manager to determine which nodes will comprise this private job tracker and the task trackers. once the private job tracker is established, Jerry will submit jobs directly to his virtual cluster. additionally, he may read/write files from the data nodes either from his job or from the gateway, which for simplicities sake, is designated as a connection to the name node.
  • When scaling for this many machines, the network lay out is extremely important. there can be a significant amount of network traffic when dealing with large scale jobs. In our configuration,rack of forty hosts has a dedicated switch. this switch then has 2 gigabit connections to each of four cores. as you can see, we always but one hop from any given host to any other host.
  • The big message here is that our team was tasked to support two very different environments. given that we’re probably one of the most, if not the most, hard core of the open source advocates inside yahoo, the decision to dump all of our internal tools was an easy one. we want to be part of the larger open source community. this means running, using, and more importantly, contributing back the same things that all of you do.
  • We had some requirements given to use by the yahoo! side of the hadoop development community as to what they wanted in place for future work. So, putting all of our requirements to support a pure multi-user environment together, this is what we came up. We’re using LDAP and DNS for things that people normally use them for with one big exception. We put in place Kerberos for our password management such that our ldap system has no users passwords in it. Using Kerberos allows us to not only support fancy things like single sign on without using SSH keys, it also allowed us to take advantage of Active Directory being used for some of the corporate IT functions. our kickstart image installs a base OS, adds a couple of tweaks, and then intalls bcfg2 to do the real work. bcfg2 allows us to centralize configuration management so that we’re not doing mass changes using trusted ssh.
  • now, it might surprise you that we’re using NFS at all. keep in mind that we do have actual users. when you have actual users, this also means you need to supply them home directories and project directories for groups. for our research grids, having this dependency upon nfs is fine. for grids that do “real work” where SLAs are at stake, NFS should be avoided as much as possible.
  • Let’s put this all together. We duplicate the top level DNS, Kerberos, and LDAP services to another data center for redundancy purposes and to provide access to these services in the other data center. From our local copy, we create a set of slaves that are read only. DNS is configured to be caching on each node, so need to duplicate that particular service in the same way. From there, we have a cloud of servers that provide configuration and provisioning management, with almost all of the boot data coming from ldap or from a source code control system that isn’t pictured. Given that the above services are horizontally scalable, we can pretty much support as many grids as we want. we just throw more servers into the farms. and, just to top it all off, we have our NFS layer. we generally share the home directory amongst all grids, but we have dedicated project directories per grid. depending upon the data center, the NFS layer may or may not be the same server.
  • i mentioned hod using torque earlier. one of the features that we rely upon heavily is the fact that torque has a health check function that will automatically enable or disable a host from being used by the schedular. This has been invaluable for us to cut down on the amount of service tickets we get from users. if a node is sick, we don’t want people to use it. We also use Nagios like a large chunk of the rest of planet. We did write some custom bits to hook into our torque health as another means to report how sick a given grid is. Something completely custom we’re using is called Simon. Simon started life as part of the search engine’s suite of monitoring tools. unfortunately, i can’t talk much about it, but i can say that it is similar in concept to ganglia, but we like to believe it is better. simon is on the roadmap to get open sourced at some point in the mysterious future.
  • breaks that are at 0 nodes are downtimes== upgrade. the bigger downtimes are for 0.16 upgrade, due to the number of issues we encountered.
  • we also manipulate machines in terms of groups. we have some custom tools written that allows to submit commands and copy files to batches of machines at once. it is likely the most underrated and underappreciated tool that system administrators have. what if you could cascade groups? and store hostnames in them? this is exactly what netgroup gives you the capability to do. and it is built into pretty much every UNIX in existence. if you have a large set of machines, and aren’t using netgroup, you have probably built something custom that does the exact same thing. in our case, we’re using the netgroup functionality to provide the backend for our group utilities. i’m hoping at some point we’ll be able to open source them as well, but we’re not quite there yet.

Deploying Grid Services Using Apache Hadoop Deploying Grid Services Using Apache Hadoop Presentation Transcript