11. Configuration in RPMs 27 April 2009 +Push out, rollback via kickstart. - Extra build step, may need kickstart server
12. clusterssh: cssh 27 April 2009 If all the machines start in the same initial state, they should end up in the same exit state
13. CM-Managed Hadoop 27 April 2009 Resource Manager keeps cluster live; talks to infrastructure Persistent data store for input data and results
14. Configuration Management tools 27 April 2009 CM tools are how to manage big clusters State Driven Workflow Centralized Radia, ITIL, lcfg Puppet Decentralized bcfg2, SmartFrog Perl scripts, makefiles
15.
16. Model the system in the SmartFrog language TwoNodeHDFS extends OneNodeHDFS { localDataDir2 extends TempDirWithCleanup { } datanode2 extends datanode { dataDirectories [LAZY localDataDir2]; dfs.datanode.https.address "https://0.0.0.0:8020"; } } 27 April 2009 Inheritance, cross-referencing, templating extending an existing template a temporary directory component extend and override with new values, including a reference to the temporary directory
20. Base Service class for all nodes public class Service extends Configured implements Closeable { public void start() throws IOException; public void innerPing(ServiceStatus status) throws IOException; void close() throws IOException; State getLifecycleState(); public enum State { UNDEFINED, CREATED, STARTED, LIVE, FAILED, CLOSED } } 27 April 2009
21. Subclasses implement transitions public class NameNode extends Service implements ClientProtocol, NamenodeProtocol, ... { protected void innerStart() throws IOException { initialize(bindAddress, getConf()); setServiceState(ServiceState.LIVE); } public void innerClose() throws IOException { if (server != null) { server.stop(); server = null; } ... } } 27 April 2009
22. Health and Liveness: ping() public class DataNode extends Service { ... public void innerPing(ServiceStatus status) throws IOException { if (ipcServer == null) { status.addThrowable( new LivenessException("No IPC Server running")); } if (dnRegistration == null) { status.addThrowable( new LivenessException("Not bound to a namenode")); } } 27 April 2009
26. Aggregated logs 17:39:08 [JobTracker] INFO mapred.ExtJobTracker : State change: JobTracker is now LIVE 17:39:08 [JobTracker] INFO mapred.JobTracker : Restoration complete 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting interTrackerServer 17:39:08 [IPC Server Responder] INFO ipc.Server : IPC Server Responder: starting 17:39:08 [IPC Server listener on 8012] INFO ipc.Server : IPC Server listener on 8012: starting 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting RUNNING 17:39:08 [Map-events fetcher for all reduce tasks on tracker_localhost:localhost/127.0.0.1:34072] INFO mapred.TaskTracker : Starting thread: Map-events fetcher for all reduce tasks on tracker_localhost:localhost/127.0.0.1:34072 17:39:08:960 GMT [INFO ][TaskTracker] HOST localhost:rootProcess:cluster - TaskTracker deployment complete: service is: tracker_localhost:localhost/127.0.0.1:34072 instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:08 [TaskTracker] INFO mapred.ExtTaskTracker : Task Tracker Service is being offered: tracker_localhost:localhost/127.0.0.1:34072 instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:09 [IPC Server handler 5 on 8012] INFO net.NetworkTopology : Adding a new node: /default-rack/localhost 17:39:09 [TaskTracker] INFO mapred.ExtTaskTracker : State change: TaskTracker is now LIVE 27 April 2009
35. XML in SCM-managed filesystem 27 April 2009 +push out, rollback. - Need to restart cluster, SPOF?
36.
37. Configuration with LDAP 27 April 2009 +View, change settings; High Availability - Not under SCM; rollback and preflight hard
38.
Editor's Notes
Some people say "when your data is in the cloud, you don't need to know where it is". We say, if you have the pager that goes off when the data isn't reachable, you do -or you need to know the phone number of someone else who does.
This is what we are talking about: Hadoop on a cluster. Most of the stuff is disposable hardware, but not your namenode. The job tracker is also something you care about. The rest, task trackers can come and go (assuming they have no side effects), datanodes can only be disposed of if the data is replicated. The rate of failure will matter a lot.
Name nodes are a point of failure. Look after them. Data nodes have to be viewed as transient. They come down, they come back up. If they aren't stable enough, you have problems as you wont ever get work done. Some organisations track hard disks. They matter more than CPUs, when you think about it. Keep track of every disk's history, its batch number, the #of power cycles of every machine it is in, etc, etc. etc. Run the log s through Hadoop later to look for correlations. Trusted machines: until Hadoop has real security, you have to trust everything on the net that can talk to the machines. Either bring them all up on a VPN (performance!) or a separate physical network
We regularly see this discussed on the list -the problem of keeping /etc/hosts synchronised, pushing out site stuff. Don’t do it by hand. Do it once to know what the files are, then stop and look for a better way.
This is how Yahoo! appear to run the largest Hadoop clusters in production. There's a good set of slides on this by Allen Wittenauer from ApacheCon 2008. LDAP would seem to add a SPOF, but you can have HA directory services. Also, if DNS is integrated with LDAP, the loss of DNS/LDAP is equally drastic, so availability is not made any worse than not using LDAP. You also need your own RPMs, as you will need to push out all binaries in RPM format. This often scares Java developers, but isn't that hard once you start on it. Hard to test though. Now, in this world, where do the hadoop XML files come from?
This is my belief as to how Hadoop 0.21 gets configured. Belief. There may be some surprises hadoop-core.jar has XML files for hadoop's own configuration, properties files for Log4J and metrics core-default has the default values for hadoop's inner architecture; hdfs-default for the filesystem and mapred-default for JobTracker and TaskTracker. This is a precursor to splitting the projects in SVN, and the JARs, and replaces hadoop-default.xml. do not edit these files unless you know what you are doing and have a need to, and an understanding that your changes may be lost The -site files in conf/ provide overrides of these XML files -for all values not marked as final. hadoop-policy is for group access policy; there is no -default in the JAR. You only need this if you have access control turned on. There is a Log4J.properties file. To change it, you would need to rebuild Hadoop. As you can point to a different resource in hadoop-env.sh, this should not be needed. The hadoop-metrics file controls what metrics are collected. I'm not aware that it can be changed without a rebuild You may want a conf/slaves file to list allowed members of the network. This helps to secure an EC2 cluster, but stops you adding new nodes on demand There is the implicit configuration of the local machine, its classpath, the network, etc. These lead to support calls. Hadoop likes a stable network with DNS and reverse-DNS working, and machines that know their external IP address. When a job is submitted, the client-side configuration is bundled up and pushed out with the submission. The client-side submission must contain enough information to connect to the job tracker and filesystem. In the XML files, there are some ${} expressions evaluating JVM properties. They get interpreted in whichever machine reads the values from the Configuration object. As it moves round the network, the values may change. This may or may not be what you want.
Cloudera are pushing out binaries and config as RPMs, with a web UI for creating new RPMs from your config -some preflight validation taking place at this point. This lets you push out new configurations via your RPM-management tooling. I'm not over-happy with relying on remotely generated RPMs for your config, because it complicates things. Its easier than using <rpmbuild> and rolling your own -I know that- but I like to have the config files as text under SCM; this removes that option. What it does do is make it easy to bring up an 18.x release of Hadoop. Ultimately Hadoop will end up having its own RPMs, or debs, as this product does only get installed on mass systems. Hopefully whatever RPMs people out together are compatible with this.
This is a screenshot of cloudera RPMS on CentOS 5.2. The VM is confused about its hostname, a fact which is going to break Hadoop, but what is key is that the RPM installed itself and you can run namenodes, datanodes, trackers as services. Other problem: the RPMs hard code a Java dependency in there. This is tricky. I avoid that in my work RPMs as you have to choose how to install java, which is pretty tricky in RPMs. These RPMs demand the sun one, not openJDK or anyone elses
This is the Cloudera strategy. Tools to create RPM Pro + Ant, maven, have (some) RPM build support. + SCM managed + quick rollback via alternatives . + no Cons -still need to trigger a service restart and reload of settings -ClearCase is a source of much stress and should be avoided by anyone who doesn't consider animal and human sacrifice part of the day-to-day operations process. - We'll have to watch what happens here; it really depends on the tooling to go around the RPM creation and serving up to the machines. A lot of the problems related to pushing out updates over a big cluster go away with virtualisation, when a single shared image is used by all the workers.
If you are going to look after a set of machines with alternatives, learn to use tools like clusterssh. This will issue the same set of ssh commands to a set of machines, even a set listed in a text file. The big problem: the same command gets sent to every box. You had better have them all in sync, have scripts that look at the state of the machine before doing things.
This is how you should view a CM-managed cluster, including a virtual one. In a &quot;cloud&quot; infrastructure the services in pink are used to work with machines
CM tools -of any kind- are the only sensible way to manage big clusters. Even then, there are issues about which design is best, here are some of the different options. What I'm going to talk about is decentralized and mostly state driven, though it has workflows too State Driven : observe system state, push it back into the desired state Workflow : apply a sequence of operations to change a machine's state Centralized : central DB in charge Decentralized : machines look after themselves
SmartFrog is a language for describing system configurations, a non-XML language with basic mathematics, cross references -including late-binding LAZY references to values not available until a system is deployed.
Here is that model turned into an instantiation across machines and multiple JVMs ( forbetter isolation of the Hadoop processes and static variables) and
Here is a demo
This is what the current proposed lifecycle for hadoop services: namenode, datanodes, trackers, etc. Things you start and stop and wait for starting. You create something, start it, it then chooses to jitter between started and live unless it fails Service extends Closeable(), so Closeable.close() shuts it down. When a service fails, the underlying exception is retained for the benefit of management tools
This is part of the base class. There's a lot more to it just to make it easy to subclass and move from state to state
Its not enough to bring up a service, you want to see if it is alive. This is the current draft of a ping operation, one that adds health to the services. Services can add any number of exceptions to the status, so if something fails, you can see what's going on. This isn't ideal (exception construction costs high due to stack traces); I think we need to work on this via some stuff on a ThrowableWriteable which can be constructed from an exception, or just a stack trace sent over the ire
This is where ping() gets complex. How do you define healthy in a cluster where the overall system availability is not 'all nodes are up' but 'we can get our work done', and that the failure of anything should be transient. I propose: push it up to the caller, report problems but don't declare failure in-node just because something remote is missing. That allows policy tools to handle it on a site-specific basis. For my clusters, if the namenode isn't up in 5 minutes, problems, kill that VM and start again. For Yahoo!, timeout is 45min and the action involves pagers.
this is how we are working on configuration; using our own language for describing things. this is good: powerful, has maths and string operations, but does add more complexity to the system. All those cross references can hide problems, so you really have to use preflight checking to look at things. But at the same time I don't want to do lots of maintenance of the files, keeping up to date with the XML stuff. so we have the ability to read in a hadoop XML file, have its values turned into SmartFrog attributes, and then for other components to reference them. I'm still exploring what is the best way to work with hadoop configurations. One real problem is staying in sync with remote servers. I'd really like every remote node to be able to serve in request up the configuration file they used -no matter how that remote node was brought up.
This is the deployed data node
Alongside the actual cluster deployment is the stuff I have initially put together for testing; the ability to work with the filesystem, to copy data in and out and to submit work -blocking or non-blocking.
This is where I am right now. I make progress, I get beaten back. I can bring up clusters in different configurations, check they are healthy, run tests. My basic MR runs are working, but some bits of Java security are interfereing with some of the JSP pages that I have added extra tests to.
This is something I'd like to deal with, later: change how we configure Hadoop. It is getting complex. More importantly -the tactic of requiring end users to have the configuration only works when the end users are in sync with the cluster -As different options for configuring Hadoop emerge, can we come up with a proper interface for doing so?
People ask about performance under VM tools. Here is the answer. Yes it is slower, especially to virtual disks that go through an extra layer of indirection before going physical. But for those people we are offering private hadoop clusters for, they are more grateful to have a cluster than its absolute performance.
This is one that is keeping me somewhat busy right now, especially on VMWare Everything in the cluster needs to know the hostname in the fs.default.name URL, else they do not bind properly You don't know it, until the namenode knows who it is. In dynamic clusters, that isn't predictable In our networks, we've used anubis, an HA tuple-space, to find these things out; we can do that here, but it doesn't boot up in networks without multicast. Machines in virtual networks don't always have rDNS or good hostname setups. They do not always know their own name.
Another option to consider: keeping all the conf files under SCM, and pushing them out via every node either mounting the same bit of disk, or by having a startup script to check out the values from a chosen SCM. Pro + all under SCM: diffs, rollbacks, &c + integrate with staging and release process + if you use ClearCase, wonderful things are possible + low cost, low tooling requirements Cons -still need to trigger a service restart and reload of settings -ClearCase is a source of much stress and should be avoided by anyone who doesn't consider animal and human sacrifice part of the day-to-day operations process. -
One option -especially in datacentres like Amazon's- is to use keystore of some form to do the name->value mapping; for AWS you could use simpleDB as the store, or as a pointer to which s3: file contains the settings -the latter would be good for rollback This is a variant of the LDAP design; you need an HA database. It is OK if your app needs this database for other things (i.e. in the Java EE architecture), but in Hadoop you've just added more complexity and points of failure. Unless you can be sure that the database doesn't go down, it is another risk.
I don't yet know of anyone who managed Hadoop via LDAP. How would you do it? -push the site settings out as (name,value) pairs. Have machines have the option of overriding it for per-machine, per-job by pointing them at machine-specific bits of the tree. Pros + integrates with rest of infrastructure + LDAP has lots of client libraries for cross-language work + High Availability + GUI lets you see what the machines are seeing + GUI lets you experiment with configuration, + No need to create new release artifacts to tune the system (ops like this) Cons - rollback? - not under SCM; can't keep conf in sync with binaries. - you don't want to keep up to date with core-default.xml by hand.