Dynamic Hadoop Clusters


Published on

How to bring up clusters of Hadoop systems using SmartFrog. Presented at ApacheCon EU 2009

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Some people say "when your data is in the cloud, you don't need to know where it is". We say, if you have the pager that goes off when the data isn't reachable, you do -or you need to know the phone number of someone else who does.
  • This is what we are talking about: Hadoop on a cluster. Most of the stuff is disposable hardware, but not your namenode. The job tracker is also something you care about. The rest, task trackers can come and go (assuming they have no side effects), datanodes can only be disposed of if the data is replicated. The rate of failure will matter a lot.
  • Name nodes are a point of failure. Look after them. Data nodes have to be viewed as transient. They come down, they come back up. If they aren't stable enough, you have problems as you wont ever get work done. Some organisations track hard disks. They matter more than CPUs, when you think about it. Keep track of every disk's history, its batch number, the #of power cycles of every machine it is in, etc, etc. etc. Run the log s through Hadoop later to look for correlations. Trusted machines: until Hadoop has real security, you have to trust everything on the net that can talk to the machines. Either bring them all up on a VPN (performance!) or a separate physical network
  • We regularly see this discussed on the list -the problem of keeping /etc/hosts synchronised, pushing out site stuff. Don’t do it by hand. Do it once to know what the files are, then stop and look for a better way.
  • This is how Yahoo! appear to run the largest Hadoop clusters in production. There's a good set of slides on this by Allen Wittenauer from ApacheCon 2008. LDAP would seem to add a SPOF, but you can have HA directory services. Also, if DNS is integrated with LDAP, the loss of DNS/LDAP is equally drastic, so availability is not made any worse than not using LDAP. You also need your own RPMs, as you will need to push out all binaries in RPM format. This often scares Java developers, but isn't that hard once you start on it. Hard to test though. Now, in this world, where do the hadoop XML files come from?
  • This is my belief as to how Hadoop 0.21 gets configured. Belief. There may be some surprises hadoop-core.jar has XML files for hadoop's own configuration, properties files for Log4J and metrics core-default has the default values for hadoop's inner architecture; hdfs-default for the filesystem and mapred-default for JobTracker and TaskTracker. This is a precursor to splitting the projects in SVN, and the JARs, and replaces hadoop-default.xml. do not edit these files unless you know what you are doing and have a need to, and an understanding that your changes may be lost The -site files in conf/ provide overrides of these XML files -for all values not marked as final. hadoop-policy is for group access policy; there is no -default in the JAR. You only need this if you have access control turned on. There is a Log4J.properties file. To change it, you would need to rebuild Hadoop. As you can point to a different resource in hadoop-env.sh, this should not be needed. The hadoop-metrics file controls what metrics are collected. I'm not aware that it can be changed without a rebuild You may want a conf/slaves file to list allowed members of the network. This helps to secure an EC2 cluster, but stops you adding new nodes on demand There is the implicit configuration of the local machine, its classpath, the network, etc. These lead to support calls. Hadoop likes a stable network with DNS and reverse-DNS working, and machines that know their external IP address. When a job is submitted, the client-side configuration is bundled up and pushed out with the submission. The client-side submission must contain enough information to connect to the job tracker and filesystem. In the XML files, there are some ${} expressions evaluating JVM properties. They get interpreted in whichever machine reads the values from the Configuration object. As it moves round the network, the values may change. This may or may not be what you want.
  • Cloudera are pushing out binaries and config as RPMs, with a web UI for creating new RPMs from your config -some preflight validation taking place at this point. This lets you push out new configurations via your RPM-management tooling. I'm not over-happy with relying on remotely generated RPMs for your config, because it complicates things. Its easier than using <rpmbuild> and rolling your own -I know that- but I like to have the config files as text under SCM; this removes that option. What it does do is make it easy to bring up an 18.x release of Hadoop. Ultimately Hadoop will end up having its own RPMs, or debs, as this product does only get installed on mass systems. Hopefully whatever RPMs people out together are compatible with this.
  • This is a screenshot of cloudera RPMS on CentOS 5.2. The VM is confused about its hostname, a fact which is going to break Hadoop, but what is key is that the RPM installed itself and you can run namenodes, datanodes, trackers as services. Other problem: the RPMs hard code a Java dependency in there. This is tricky. I avoid that in my work RPMs as you have to choose how to install java, which is pretty tricky in RPMs. These RPMs demand the sun one, not openJDK or anyone elses
  • This is the Cloudera strategy. Tools to create RPM Pro + Ant, maven, have (some) RPM build support. + SCM managed + quick rollback via alternatives . + no Cons -still need to trigger a service restart and reload of settings -ClearCase is a source of much stress and should be avoided by anyone who doesn't consider animal and human sacrifice part of the day-to-day operations process. - We'll have to watch what happens here; it really depends on the tooling to go around the RPM creation and serving up to the machines. A lot of the problems related to pushing out updates over a big cluster go away with virtualisation, when a single shared image is used by all the workers.
  • If you are going to look after a set of machines with alternatives, learn to use tools like clusterssh. This will issue the same set of ssh commands to a set of machines, even a set listed in a text file. The big problem: the same command gets sent to every box. You had better have them all in sync, have scripts that look at the state of the machine before doing things.
  • This is how you should view a CM-managed cluster, including a virtual one. In a "cloud" infrastructure the services in pink are used to work with machines
  • CM tools -of any kind- are the only sensible way to manage big clusters. Even then, there are issues about which design is best, here are some of the different options. What I'm going to talk about is decentralized and mostly state driven, though it has workflows too State Driven : observe system state, push it back into the desired state Workflow : apply a sequence of operations to change a machine's state Centralized : central DB in charge Decentralized : machines look after themselves
  • SmartFrog is a language for describing system configurations, a non-XML language with basic mathematics, cross references -including late-binding LAZY references to values not available until a system is deployed.
  • Here is that model turned into an instantiation across machines and multiple JVMs ( forbetter isolation of the Hadoop processes and static variables) and
  • Here is a demo
  • This is what the current proposed lifecycle for hadoop services: namenode, datanodes, trackers, etc. Things you start and stop and wait for starting. You create something, start it, it then chooses to jitter between started and live unless it fails Service extends Closeable(), so Closeable.close() shuts it down. When a service fails, the underlying exception is retained for the benefit of management tools
  • This is part of the base class. There's a lot more to it just to make it easy to subclass and move from state to state
  • Its not enough to bring up a service, you want to see if it is alive. This is the current draft of a ping operation, one that adds health to the services. Services can add any number of exceptions to the status, so if something fails, you can see what's going on. This isn't ideal (exception construction costs high due to stack traces); I think we need to work on this via some stuff on a ThrowableWriteable which can be constructed from an exception, or just a stack trace sent over the ire
  • This is where ping() gets complex. How do you define healthy in a cluster where the overall system availability is not 'all nodes are up' but 'we can get our work done', and that the failure of anything should be transient. I propose: push it up to the caller, report problems but don't declare failure in-node just because something remote is missing. That allows policy tools to handle it on a site-specific basis. For my clusters, if the namenode isn't up in 5 minutes, problems, kill that VM and start again. For Yahoo!, timeout is 45min and the action involves pagers.
  • this is how we are working on configuration; using our own language for describing things. this is good: powerful, has maths and string operations, but does add more complexity to the system. All those cross references can hide problems, so you really have to use preflight checking to look at things. But at the same time I don't want to do lots of maintenance of the files, keeping up to date with the XML stuff. so we have the ability to read in a hadoop XML file, have its values turned into SmartFrog attributes, and then for other components to reference them. I'm still exploring what is the best way to work with hadoop configurations. One real problem is staying in sync with remote servers. I'd really like every remote node to be able to serve in request up the configuration file they used -no matter how that remote node was brought up.
  • This is the deployed data node
  • Alongside the actual cluster deployment is the stuff I have initially put together for testing; the ability to work with the filesystem, to copy data in and out and to submit work -blocking or non-blocking.
  • This is where I am right now. I make progress, I get beaten back. I can bring up clusters in different configurations, check they are healthy, run tests. My basic MR runs are working, but some bits of Java security are interfereing with some of the JSP pages that I have added extra tests to.
  • This is something I'd like to deal with, later: change how we configure Hadoop. It is getting complex. More importantly -the tactic of requiring end users to have the configuration only works when the end users are in sync with the cluster -As different options for configuring Hadoop emerge, can we come up with a proper interface for doing so?
  • People ask about performance under VM tools. Here is the answer. Yes it is slower, especially to virtual disks that go through an extra layer of indirection before going physical. But for those people we are offering private hadoop clusters for, they are more grateful to have a cluster than its absolute performance.
  • This is one that is keeping me somewhat busy right now, especially on VMWare Everything in the cluster needs to know the hostname in the fs.default.name URL, else they do not bind properly You don't know it, until the namenode knows who it is. In dynamic clusters, that isn't predictable In our networks, we've used anubis, an HA tuple-space, to find these things out; we can do that here, but it doesn't boot up in networks without multicast. Machines in virtual networks don't always have rDNS or good hostname setups. They do not always know their own name.
  • Another option to consider: keeping all the conf files under SCM, and pushing them out via every node either mounting the same bit of disk, or by having a startup script to check out the values from a chosen SCM. Pro + all under SCM: diffs, rollbacks, &c + integrate with staging and release process + if you use ClearCase, wonderful things are possible + low cost, low tooling requirements Cons -still need to trigger a service restart and reload of settings -ClearCase is a source of much stress and should be avoided by anyone who doesn't consider animal and human sacrifice part of the day-to-day operations process. -
  • One option -especially in datacentres like Amazon's- is to use keystore of some form to do the name->value mapping; for AWS you could use simpleDB as the store, or as a pointer to which s3: file contains the settings -the latter would be good for rollback This is a variant of the LDAP design; you need an HA database. It is OK if your app needs this database for other things (i.e. in the Java EE architecture), but in Hadoop you've just added more complexity and points of failure. Unless you can be sure that the database doesn't go down, it is another risk.
  • I don't yet know of anyone who managed Hadoop via LDAP. How would you do it? -push the site settings out as (name,value) pairs. Have machines have the option of overriding it for per-machine, per-job by pointing them at machine-specific bits of the tree. Pros + integrates with rest of infrastructure + LDAP has lots of client libraries for cross-language work + High Availability + GUI lets you see what the machines are seeing + GUI lets you experiment with configuration, + No need to create new release artifacts to tune the system (ops like this)‏ Cons - rollback? - not under SCM; can't keep conf in sync with binaries. - you don't want to keep up to date with core-default.xml by hand.
  • Pull this, replace with more live code
  • Dynamic Hadoop Clusters

    1. Dynamic Hadoop Clusters Steve Loughran Julio Guijarro http://wiki.smartfrog.org/wiki/display/sf/Dynamic+Hadoop+Clusters
    2. 27 April 2009
    3. Hadoop on a cluster 27 April 2009 1 namenode, 1+ Job Tracker, many data nodes and task trackers
    4. Cluster design 27 April 2009 <ul><li>Name node: high end box, RAID + backups. -this is the SPOF. Nurture it </li></ul><ul><li>Secondary name node —as name node </li></ul><ul><li>Data nodes: mid-range multicore blade systems No RAID. Maybe: 2 disks/core </li></ul><ul><li>Job Tracker: standalone server </li></ul><ul><li>Task Trackers: on/near the data nodes </li></ul><ul><li>Secure the LAN </li></ul><ul><li>Everything will fail -learn to read the logs </li></ul>
    5. Management problems big applications <ul><li>Configuration </li></ul><ul><li>Lifecycle </li></ul><ul><li>Troubleshooting </li></ul>27 April 2009
    6. The hand-managed cluster 27 April 2009 <ul><li>Manual install onto machines </li></ul><ul><li>SCP/FTP in Hadoop tar file </li></ul><ul><li>Edit the -site.xml and log4j files </li></ul><ul><li>edit /etc/hosts, /etc/rc5.d, ssh keys … </li></ul><ul><li>Installation scales O(N) ‏ </li></ul><ul><li>Maintenance, debugging scales worse </li></ul><ul><li>Do not try this more than once </li></ul>
    7. The locked-down cluster 27 April 2009 <ul><li>PXE/gPXE Preboot of OS images </li></ul><ul><li>RedHat Kickstart to serve up (see instalinux.com) ‏ </li></ul><ul><li>Maybe: LDAP to manage state </li></ul><ul><li>Chukwa for log capture/analysis? </li></ul><ul><li>Uniform images, central LDAP service, good ops team, stable configurations, home-rolled RPMs </li></ul><ul><li>How Yahoo! work? </li></ul>
    8. How do you configure Hadoop 0.21? 27 April 2009
    9. 27 April 2009 <ul><li>RPM-packaged Hadoop distributions </li></ul><ul><li>Web UI creates configuration RPMs </li></ul><ul><li>Configurations managed with &quot;alternatives&quot; </li></ul>cloudera.com/hadoop
    10. 27 April 2009
    11. Configuration in RPMs 27 April 2009 +Push out, rollback via kickstart. - Extra build step, may need kickstart server
    12. clusterssh: cssh 27 April 2009 If all the machines start in the same initial state, they should end up in the same exit state
    13. CM-Managed Hadoop 27 April 2009 Resource Manager keeps cluster live; talks to infrastructure Persistent data store for input data and results
    14. Configuration Management tools 27 April 2009 CM tools are how to manage big clusters State Driven Workflow Centralized Radia, ITIL, lcfg Puppet Decentralized bcfg2, SmartFrog Perl scripts, makefiles
    15. SmartFrog - HPLabs' CM tool <ul><li>Language for describing systems to deploy —everything from datacentres to test cases </li></ul><ul><li>Runtime to create components from the model </li></ul><ul><li>Components have a lifecycle </li></ul><ul><li>Apache 2.0 Licensed from June/July 2009 </li></ul><ul><li>http://smartfrog.org/ </li></ul>27 April 2009
    16. Model the system in the SmartFrog language TwoNodeHDFS extends OneNodeHDFS { localDataDir2 extends TempDirWithCleanup { } datanode2 extends datanode { dataDirectories [LAZY localDataDir2]; dfs.datanode.https.address &quot;;; } } 27 April 2009 Inheritance, cross-referencing, templating extending an existing template a temporary directory component extend and override with new values, including a reference to the temporary directory
    17. The runtime deploys the model 27 April 2009 Add better diagram
    18. DEMO 27 April 2009
    19. HADOOP-3628: A lifecycle for services 27 April 2009
    20. Base Service class for all nodes public class Service extends Configured implements Closeable { public void start() throws IOException; public void innerPing(ServiceStatus status) throws IOException; void close() throws IOException; State getLifecycleState(); public enum State { UNDEFINED, CREATED, STARTED, LIVE, FAILED, CLOSED } } 27 April 2009
    21. Subclasses implement transitions public class NameNode extends Service implements ClientProtocol, NamenodeProtocol, ... { protected void innerStart() throws IOException { initialize(bindAddress, getConf()); setServiceState(ServiceState.LIVE); } public void innerClose() throws IOException { if (server != null) { server.stop(); server = null; } ... } } 27 April 2009
    22. Health and Liveness: ping() ‏ public class DataNode extends Service { ... public void innerPing(ServiceStatus status) throws IOException { if (ipcServer == null) { status.addThrowable( new LivenessException(&quot;No IPC Server running&quot;)); } if (dnRegistration == null) { status.addThrowable( new LivenessException(&quot;Not bound to a namenode&quot;)); } } 27 April 2009
    23. Liveness issues <ul><li>If a datanode cannot see a namenode, is it still healthy? </li></ul><ul><li>If a namenode has no data nodes, is it healthy? </li></ul><ul><li>How to treat a failure of a ping? Permanent failure of service, or a transient outage? </li></ul>27 April 2009 How unavailable should the nodes be before a cluster is &quot;unhealthy&quot; ?
    24. Replace hadoop-*.xml with .sf files NameNode extends FileSystemNode { nameDirectories TBD; dataDirectories TBD; logDir TBD; dfs.http.address &quot;;; dfs.namenode.handler.count 10; dfs.namenode.decommission.interval (5 * 60); dfs.name.dir TBD; dfs.permissions.supergroup &quot;supergroup&quot;; dfs.upgrade.permission &quot;0777&quot; dfs.replication 3; dfs.replication.interval 3; . . . } 27 April 2009
    25. Hadoop Cluster under SmartFrog 27 April 2009
    26. Aggregated logs 17:39:08 [JobTracker] INFO mapred.ExtJobTracker : State change: JobTracker is now LIVE 17:39:08 [JobTracker] INFO mapred.JobTracker : Restoration complete 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting interTrackerServer 17:39:08 [IPC Server Responder] INFO ipc.Server : IPC Server Responder: starting 17:39:08 [IPC Server listener on 8012] INFO ipc.Server : IPC Server listener on 8012: starting 17:39:08 [JobTracker] INFO mapred.JobTracker : Starting RUNNING 17:39:08 [Map-events fetcher for all reduce tasks on tracker_localhost:localhost/] INFO mapred.TaskTracker : Starting thread: Map-events fetcher for all reduce tasks on tracker_localhost:localhost/ 17:39:08:960 GMT [INFO ][TaskTracker] HOST localhost:rootProcess:cluster - TaskTracker deployment complete: service is: tracker_localhost:localhost/ instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:08 [TaskTracker] INFO mapred.ExtTaskTracker : Task Tracker Service is being offered: tracker_localhost:localhost/ instance org.apache.hadoop.mapred.ExtTaskTracker@8775b3a in state STARTED; web port=50060 17:39:09 [IPC Server handler 5 on 8012] INFO net.NetworkTopology : Adding a new node: /default-rack/localhost 17:39:09 [TaskTracker] INFO mapred.ExtTaskTracker : State change: TaskTracker is now LIVE 27 April 2009
    27. File and Job operations DFS manipulation: DfsCreateDir, DfsDeleteDir, DfsListDir, DfsPathExists, DfsFormatFileSystem, DFS I/O: DfsCopyFileIn, DfsCopyFileOut 27 April 2009 TestJob extends BlockingJobSubmitter { name &quot;test-job&quot;; cluster LAZY PARENT:cluster; jobTracker LAZY PARENT:cluster; mapred.child.java.opts &quot;-Xmx512m&quot;; mapred.tasktracker.map.tasks.maximum 5; mapred.tasktracker.reduce.tasks.maximum 1; mapred.map.max.attempts 1; mapred.reduce.max.attempts 1; }
    28. What does this let us do? <ul><li>Set up and tear down Hadoop clusters </li></ul><ul><li>Manipulate the filesystem </li></ul><ul><li>Get a console view of the whole system </li></ul><ul><li>Allow different cluster configurations </li></ul><ul><li>Automate failover policies </li></ul>27 April 2009
    29. Status as of June 2009 <ul><li>SmartFrog code in sourceforge SVN </li></ul><ul><li>HADOOP-3628-2 branch on svn.apache.org </li></ul><ul><li>Building RPMs for managing local clusters </li></ul><ul><li>Hosting on VMs </li></ul><ul><li>Submitting simple jobs </li></ul><ul><li>Troublespots: Java security, JSP </li></ul>27 April 2009
    30. Issue: Hadoop configuration <ul><li>Trouble: core-site.xml, mapred-site … </li></ul><ul><li>Current SmartFrog support subclasses JobConf </li></ul><ul><li>Better to have multiple sources of configuration </li></ul><ul><ul><li>XML </li></ul></ul><ul><ul><li>LDAP </li></ul></ul><ul><ul><li>Databases </li></ul></ul><ul><ul><li>SmartFrog </li></ul></ul>27 April 2009
    31. Issue: VM performance <ul><li>CPU performance under Xen, VMWare slightly slower </li></ul><ul><li>Disk IO measurably worse than physical </li></ul><ul><li>Startup costs if persistent data kept elsewhere </li></ul><ul><li>VM APIs need to include source data/locality </li></ul><ul><li>Swapping and clock drift causes trouble </li></ul>27 April 2009 Cluster availability is often more important than absolute performance
    32. Issue: binding on a dynamic network <ul><li>Discovery on networks without multicast </li></ul><ul><li>Hadoop on networks without reverse DNS </li></ul><ul><li>Need IP address only (no forward DNS)‏ </li></ul><ul><li>What if nodes change during a cluster's life? </li></ul>27 April 2009
    33. Call to action <ul><li>Dynamic Hadoop clusters are a good way to explore Hadoop </li></ul><ul><li>Come and play with the SmartFrog Hadoop tools </li></ul><ul><li>Get involved with managing Hadoop </li></ul><ul><li>Help with lifecycle, configuration issues </li></ul>21 June 2009
    34. 27 April 2009
    35. XML in SCM-managed filesystem 27 April 2009 +push out, rollback. - Need to restart cluster, SPOF?
    36. Configuration-in-database <ul><li>JDBC, CouchDB, SimpleDB, … </li></ul><ul><li>Other name-value keystore? </li></ul><ul><li>Bootstrap problem -startup parameters? </li></ul><ul><li>Rollback and versioning? </li></ul>27 April 2009
    37. Configuration with LDAP 27 April 2009 +View, change settings; High Availability - Not under SCM; rollback and preflight hard
    38. Getting Hadoop to deploy with SmartFrog <ul><li>Configure Hadoop from an SmartFrog description </li></ul><ul><li>Write components for the Hadoop nodes </li></ul><ul><li>Write the functional tests </li></ul><ul><li>Add workflow components to work with the filesystem; submit jobs </li></ul><ul><li>Get the tests to pass </li></ul>27 April 2009