Whatever happened to cman?                             Christine “Chrissie” Caulfield, Red Hat                            ...
1. IntroductionCMAN was the mainstay of Red Hats cluster suite in Red Hat Enterprise Linux 4, 5 and 6. ActuallyCMAN in RHE...
Services included in openais are:Amf Ckpt Clm Evt Lck Msg (see what I mean about the capitalisation ... and thats just the...
4.1. Those modules in fullCorosync config modules are different from the normal service modules that corosync uses. They a...
mode where it can start up cman/corosync with no configuration data at all (see cman_tool -X) which isvery useful for time...
cman_end_recv_datacman_is_listeningOf these, the messaging API can be replaced with the normal openais messaging system. T...
However, as I mentioned above (or on the previous page if you were ungreen enough to print thisdocument out) this will not...
cs_error_t votequorum_fd_get(votequorum_handle_t handle, int *fd);cs_error_t votequorum_dispatch(votequorum_handle_t handl...
cs_error_t votequorum_setstate(votequorum_handle_t handle)This call sets a non-resettable flag “HasState” in votequorum. T...
expected_votes          The number of votes that this cluster expects. votequorum will use the                            ...
have random and massively different nodeids. eg:# cman_tool nodesNode Id       Sts     Name16951488         M     amy33728...
Use libcpgcman_get_node_extracman_get_subsys_countcman_is_activecman_is_listeningcman_barrier_registercman_barrier_changec...
Upcoming SlideShare
Loading in...5

Whither cman


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Whither cman

  1. 1. Whatever happened to cman? Christine “Chrissie” Caulfield, Red Hat ccaulfie@redhat.comVersion history 0.1 30th January 2009 First draft 0.2 3rd February 2009 Add a chapter on migrating from libcman 0.2 27th February 2009 Add some commas and fix some typos
  2. 2. 1. IntroductionCMAN was the mainstay of Red Hats cluster suite in Red Hat Enterprise Linux 4, 5 and 6. ActuallyCMAN in RHEL4 was a totally different thing to that found in RHEL5 and 6 but the external interfaceslooked very similar.So what was CMAN?CMAN stood for “Cluster MANager” (boring eh?) and provided the low-level infrastructure for thecluster; node communications, votes, quorum etc. Most of the useful bits of the cluster stack were builton top of cman so few people got to see it unless things went horribly wrong.Starting with RHEL5 we dispensed with the custom-written kernel CMAN and based the clustermanager on OpenAIS. For more details about this see my document on the subject:http://people.redhat.com/ccaulfie/docs/aiscman.pdf.OK, a short summary then. CMAN in RHEL 5/6 is really just a quorum module that loads into openais(these days known as corosync). It provides other API services such as messaging but those are reallyjust wrappers around other openais services.At the cluster summit in 2008 it was decided that keeping companies own clustering modules was awaste of effort and almost always fails to build a community. Even though CMAN (all versions of it)were released as free software, they lived in their own repositories and were seldom modified by theoriginal author (ie me!). So efforts were made to make the software even more open and included in thecomponents of their parent projects. Also part of the collaboration efforts were the pooling of resourceand fence agents and several others. But this is just about CMAN OK? Come on now, concentrate.2. corosync/openais – what happened there?In Red Hat Enterprise Linux 5 cman was a module for openais. In RHEL6 (were not actually supposedto call it RHEL, but “Red Hat Enterprise Linux” is a lot of typing. And even if I do a search/replaceafterwards itll be unwieldy to read ... sorry marketing, but this is an engineering paper, not a sales one)its a module for corosync. So, what happened there?Just in case you werent already confused enough, openais has now been split into two parts. The partsstill called openais are the modules for corosync that implement the SAF AIS services and APIs – youknow: the ones with the VeryAnnoyinglyCapitalisedNames. Corosync is the communications layerbelow that, but it also includes other services such as cpg (Closed Process Groups ... very popularamongst service writers than one) and, now, quorum. Most people wont really notice this split, apartfrom, possibly, the name of the process in ps output.Just to put a little bit of detail on this, the services that are included with corosync are:cfg cpg evs quorum
  3. 3. Services included in openais are:Amf Ckpt Clm Evt Lck Msg (see what I mean about the capitalisation ... and thats just the names!)corosync is the name of the daemon, replacing aisexec.I hope thats all clear because it gets more complicated from here on in.3. Intermediate status – RHEL6Before getting onto the really new code, Ill talk a little about the RHEL6 code which is sort of a half-way house between RHEL5 and the code that will be in RHEL7. I suppose that makes some sort ofsense, numerically if nothing else.The first thing youll notice about RHEL6 is that ccsd is not there any more. Actually I very muchdoubt itll be really the very first thing you notice unless youre incredibly sad or a real clustering nerd,but I do hope you might have spotted it at some stage. Ill talk (well, write – unless you want to inviteme, all expenses paid, to give a presentation) about that in the next chapter.Apart from this, you will notice very little difference in the RHEL6 cman. We were under pressure toallow rolling upgrades from RHEL5 (ie no wire protocol changes) and time constraints mean that thecode discussed below was not stable enough to be included for the deadline. The only other substantialdifference is that cman now is a quorum provider for corosync. Ill also write more about this this lateron (youre really agog now arent you?), but basically it means that the rest of the corosync moduleswill know about the quorum status and can act on it if they feel the urge, and that applications that areonly interested in the quorum status and node list can link straight into libquorum and bypass cmanaltogether. This might help some people with the upgrade path ... or not ... well see.4. Configuring corosyncThis chapter is also relevant to RHEL6 by the way, just in case you were going to skip the rest of thedocument and have an early night.So, as I mentioned above, ccsd is now gone and I can hear the cheers already. The distribution ofcluster.conf files is now delegated to ricci and a new command ccs_sync. I will not focus too closely onthat bit any more but it brings me to the next main change in the way CMAN (as it still is at this stage)is loaded. Actually you dont have to use ccs_sync to distribute cluster config files, conga, scp, nfs andseveral other methods work perfectly well. Pick that one that suits you best; “choice is good” as thefree-marketeers love to say.The way that corosync (not aisexec remember?) is launched is still via cman_tool as before. But it nowsets a few environment variables first. The most important of these tells corosync which configurationmodules to use, and there are up to three of those now.The first, xmlconfig, simply loads the contents of /etc/cluster/cluster.conf into the corosync objectdatabase, where it keeps all the configuration information. The second, cman-preconfig, interprets thatformat into a format that is native to corosync by resolving nodenames, generating multicast addressesetc.
  4. 4. 4.1. Those modules in fullCorosync config modules are different from the normal service modules that corosync uses. They arespecified in the environment variable COROSYNC_DEFAULT_CONFIG_IFACE before corosync isstarted. This variable can contain a colon-separated list of modules that will be called before corosyncis started for real. They also manage the reloading of data when the underlying configuration file ordevice changes. By default cman_tool starts corosync withCOROSYNC_DEFAULT_CONFIG_IFACE=xmlconfig:cmanpreconfig:openaisserviceenable4.2. xmlconfig & friendsxmlconfig is simply an XML parser with a default filename of /etc/cluster/cluster.conf. It reads the treeof objects it finds in that file and loads it, unmodified, into corosyncs object database (objdb). Thereare a few other config modules that can be used here. The is an ldap one (called, rather obviously Ihope, ldapconfig) and even a ccsd one too if youre hopelessly addicted to the past. These modules arequite easy to write and if you have a different back-end database you want to use its probably quitesimple to write a corosync config module for it.Just in case you were wondering, objdb is simply an in-memory hierarchical database where corosyncholds configuration data and some state.The point about xmlconfig and its similarly-named friends (actually, maybe theyre relatives ... hmm)is that they do not try to interpret the data, they simply copy what is there into objdb. If it makes nosense to corosync then it will tell you when it tries to start up properly shortly afterwards. You dontneed cman-preconfig (see below) to use these if the format of the data matches the hierarchy ofcorosync.conf(5).cman_tool join allows you to specify a different main config module than xmlconfig on the command-line with cman_tool join -C <module>. It will always add cman-preconfig though. eg the command:# cman_tool join -C ldapconfigwill load corosync with the environment variable:COROSYNC_DEFAULT_CONFIG_IFACE=ldapconfig:cmanpreconfig:openaisserviceenable4.3 cman-preconfigCman-preconfig is a rather strange beast in some ways. Its main purpose is to re-interpret the loadedconfiguration file. Let me endeavour to explain.CMAN since RHEL4 has a very specific XML-based format which has not changed significantly in themeantime. It contains details of all the nodes in the cluster, along with their fence details and alsorgmanager services too. This format bears no resemblance to corosync.conf. If you load a normalRHEL cluster.conf into corosync using xmlconfig alone, without passing it through cman-preconfigthen corosync will not start. It will not be able to find a single entry that it needs to run.cman-preconfig re-interprets this configuration data into a form that corosync expects, and adds somesensible, we hope, defaults for things not specified at all, eg the cluster multicast address. It also has a
  5. 5. mode where it can start up cman/corosync with no configuration data at all (see cman_tool -X) which isvery useful for time-poor or lazy people who just want to test the cluster a little, or perhaps forcommissioning virtualised clusters. Theres a whole chapter on this later on.4.4. openaisserviceenableThis module simply tells corosync to also load the openais services as well it its own default set.Normally this is what you will want as many cluster daemons use openais services like checkpointing.You can disable this by adding the -A switch to cman_tool join.5. CMAN itselfNow we get to the nitty-gritty. The CMAN module exists in RHEL5 and RHEL6 and does all thecman-y things that cman does such as quorum and messaging. It also services the libcman library usingan IPC mechanism all of its own. It does not use corosyncs IPC (library to daemon communications)system for reasons that are unlikely to become clear any more.The new code, as Ive said, dispenses with anything called cman at all. The main work of managingquorum is done by a built-in corosync module called votequorum. This code is actually largely liftedfrom the original cman module, with all the non-quorum bits removed, and converted to use thecorosync IPC layer.This module also includes code for managing quorum devices (eg qdiskd) as well as the hated“Disallowed” mode. Though both are inactive in a default corosync system.At the time of writing libcman still exists. It provides the same API as the libcman in previous versionsthough there might be small semantic differences. Its also not ABI compatible so the soname has beenincremented to 4. This libcman uses a combination of calls to quorum, confdb and a custom cmanmodule to provide these services. It is provided largely as a compatibility layer for older applicationsthough the source code to libcman can also provide ideas on how to do cman-y type things with theexisting corosync services. All the programs in Red Hat clustering will be changed to use the standardAPIs that are provided with corosync.6. The CMAN moduleNow hang on there, you just said “a custom cman module”. So cman does still exist?Well, yes and no.There is a module called “cmanbits” which provides services for some libcman APIs. These are:cman_send_data,cman_start_recv_data
  6. 6. cman_end_recv_datacman_is_listeningOf these, the messaging API can be replaced with the normal openais messaging system. There is nosuitable replacement for is_listening.Missing totally from this new cman is the barriers API which nobody used, to my knowledge. If youdid and are fuming with me for leaving it out ... sorry. But there are actually better ways of achievingthe same effect with existing corosync or openais calls.Some other functions that were in libcman are now provided by the corosync cfg service. These arecorosync_cfg_kill_nodecorosync_cfg_try_shutdowncorosync_cfg_replyto_shutdowncorosync_cfg_get_node_addrsThe are wrapped in the new libcman but are deprecated.7. corosync quorum7.1. quorum pluginsA brief word about corosyncs quorum system might he helpful here. Corosync has a basic notion ofquorum built in to the executive. Services cannot be synchronised without it and several IPC calls willbe blocked if the cluster is inquorate.By default corosync does not load the quorum module. This is for backwards compatibility so thatexisting clusters will work. If the quorum module was loaded and no quorum provider was specifiedthen the cluster would never be quorate and never get any work done. Not a popular system I wouldguess. If the quorum module is absent then everything just works all the time.To load the quorum module, you need to add the following stanza to corosync.conf (or equivalent):service { name: corosync_quorum ver:0}cman-preconfig does this for you by the way. Corosync will then load the quorum module and allowusers to use the basic quorum API as defined in /usr/include/corosync/quorum.h. This simply allowsprograms to query the quorate state and to receive notifications when it changes. This notification alsoincludes a list of nodes that are part of the quorum. Its useful (I hope) to realise that this plugin is alsoavailable in RHEL6 and that cman provides quorum for it. So if you only need to know the quorumstate (ie you dont need any of the other libcman gubbins) then calling into libquorum instead willprovide you with forward-compatibility.
  7. 7. However, as I mentioned above (or on the previous page if you were ungreen enough to print thisdocument out) this will not give you quorum, all it will do is to block a lot of useful activity incorosync. To make it work you need a quorum provider.A quorum provider is a corosync module that tells the corosync quorum plugin when the quorate statechanges, based in its own internal algorithm. There are many ways of calculating quorum so there areseveral modules available, and I hope people will write more to suit their own needs. The quorumprovider is loaded by putting its name in corosync.conf as follows:quorum { provider: testquorum}This example, as I sincerely hoped you have guessed, loads the testquorum module. Testquorum is anextremely simple module which I recommend as a template to other quorum provider writers. It simplychanges the corosync quorate status based on the quorum.quorate variable in the object database. Usingcorosync-objctl the quorate state can be switched on and off at will. If you have some form of externalquorum system this might actually be useful to you.Another quorum provider shipped with corosync is YKD. This is the YKD system that has beenshipped with openais for some time, it has merely been re-designated a quorum plugin now. Sadly thedevelopers did not have time to give it a shiny new logo too so it looks pretty much like it always did.To be quite frank, I know almost nothing about YKD and it scares me a little, so unless a knight inshining armour (ideally looking like David Tennant) comes along to rescue me Ill leave it there.Now, where was I, sorry I got distracted by the David Tennant reference. Oh yes, votequorum is thequorum module based on cman. This is the one I know most about and is probably the most interest tothose of you coming from a cman background. So Ill devote an entire subsection it. Favouritism? Pah!7.2. votequorumVotequorum is the name of the corosync module that replaces most of that cman used to do. Only itdoes it in a more corosync-y way. It uses the corosync IPC system for communication with users, itprovides quorum to the central core of the corosync daemon and provides familiar tracking callbacks.Ill outline the API calls for votequorum here. If you are familiar with libcman then you can go and geta hot chocolate or something while this bit is going on ... see you in chapter 8.cs_error_t votequorum_initialize(votequorum_handle_t *handle, votequorum_callbacks_t*callbacks)cs_error_t votequorum_finalize(votequorum_handle_t handle)These two are the familiar corosync initialise & finalise calls that corosync/openais developers shouldalready be familiar with. libcman users will note that the callbacks are specified here rather than in laterAPI calls. The callbacks are for quorum change or expected votes change – remember votequorumdoes not have a messaging API.
  8. 8. cs_error_t votequorum_fd_get(votequorum_handle_t handle, int *fd);cs_error_t votequorum_dispatch(votequorum_handle_t handle,cs_dispatch_flags_tdispatch_types)Again, these should be familiar sort of call to both corosync and libcman users. You callvotequorum_dispatch when the FD returned from votequrum_fd_get becomes active from select orpoll.cs_error_t votequorum_getinfo (votequorum_handle_t handle, unsigned int nodeid, structvotequorum_info *info)This call fills in the info structure with information about the quorum subsystem and the node forwhich you requested information. Members starting node_ are specific to the node in question, theothers are for the cluster as a whole.cs_error_t votequorum_setexpected(votequorum_handle_t handle, unsigned int expected_votes)This call changes the expected_votes value for the cluster. This will persist until it is changed again.NOTE: if you are using a configuration file then a forced reload of that file will overwrite a manuallyset votes value, as will adding a new node to the cluster.cs_error_t votequorum_setvotes (votequorum_handle_t handle, unsigned int nodeid, unsignedint votes)This call changes the number of votes assigned to a existing node in the cluster. This will persist until itis changed again. NOTE: if you are using a configuration file then a forced reload of that file willoverwrite a manually set votes value.cs_error_t votequorum_qdisk_register(votequorum_handle_t handle, char *name, unsigned intvotes)cs_error_t votequorum_qdisk_unregister(votequorum_handle_t handle)cs_error_t votequorum_qdisk_poll(votequorum_handle_t handle, unsigned int state)cs_error_t votequorum_qdisk_getinfo(votequorum_handle_t handle, structvotequorum_qdisk_info *info)This group of calls manages an optional extra quorum device, often a quorum disk. There can only beone quorum device per cluster and it is up to the quorum device software to ensure that its state isconsistent across the cluster or to remove it from quorum. The device can have any number of votes,but those votes cannot be changed without re-registering the device – doing this might cause atemporary loss of quorum. Quorum devices must poll corosync regularly to be regarded as alive, bydefault if a quorum device has not called votequorum_qdisk_poll() after ten seconds then it will beregarded as dead and removed from quorum. This poll period is customisable, see below.
  9. 9. cs_error_t votequorum_setstate(votequorum_handle_t handle)This call sets a non-resettable flag “HasState” in votequorum. This state is available from the getinfocall above and used by some subsystems to prevent cluster merges. It only makes sense to call this ifthe dreaded “disallowed” flag is set. (see configuration variable below).cs_error_t votequorum_trackstart(votequorum_handle_t handle, uint64_t context, unsigned intflags)cs_error_t votequorum_trackstop(votequorum_handle_t handle)These are the usual tracking start/stop calls for a corosync plugin. Trackstart enables quorum orexpected_votes change messages to be sent to the notification callback. To avoid race conditions it isrecommended that the flags CS_TRACKSTART | CS_TRACKCHANGES are used so the programwill get one immediate callback containing the current quorum state, then further callbacks as nodesjoin or leave the cluster.cs_error_t votequorum_leaving(votequorum_handle_t handle)This call sets a flag inside all nodes in the cluster telling them that the node will be leaving the clustersoon. When the node actually does leave the cluster, then quorum will be reduced and activity willcarry on. This is like cman_tool leave remove in cman. If this is not called before removing a nodethen the number of active votes will be reduced and the cluster could lose quorum.Not that this call does not remove the node from the cluster, it merely sets the flag saying it will. Thereis a timeout on this flag of ten seconds. If the node is not shut down within this time it will becancelled. It is important that this timeout is longer than CFGs shutdown timeout otherwise leaving willbe cancelled if daemons dont respond.cs_error_t votequorum_context_get(votequorum_handle_t handle, void **context)cs_error_t votequorum_context_set(votequorum_handle_t handle, void *context)These two calls simply allow the context variable returned to the callbacks to be changed.7.2.1 configuring votequorumThe votequorum module has a small number of variables that you can set in objdb. They are all heldunder the main quorum{} section. quorumdev_poll This is the time between quorum device polls, in milliseconds. The default is 10000. leaving_timeout If the node is not shutdown <leaving_timeout> milliseconds after votequorum_leave is called then the leave flag will be cancelled. The default is 10000. votes The number of votes this node has. The default is 1.
  10. 10. expected_votes The number of votes that this cluster expects. votequorum will use the largest value of all the nodes in the cluster. The default is 1024. ie you will not have quorum unless you change it! two_node Enables a special two-node mode where a cluster with only two nodes and no quorum device can continue running with only one node. This expects you to have hardware fencing configured. two_node mode can be cleared without a cluster restart (ie you can add a third node) but it cannot be set without restarting all nodes. The default is 0. disallowed This is an odd mode that prevents two clusters with state (See setstate() above) from merging. If this happens then nodes will be marked “disallowed” and not included in quorum calculations. Those who have read up on RHEL5 cman will know this mode and fear it. It cannot be changed without a cluster restart. The default is 0.8 Configuration free clustersConfiguring a cluster is hard work. Dont you wish you could just plug in a node and have it “justwork”? Well, within several limitations you can. This mode has existed in cman since RHEL4 but isnot well documented (oops, maybe thats my fault) and is hardly used to my knowledge.The key command to remember is# cman_tool join -XOn a freshly installed system this will fail with the error:corosync died: Could not read cluster configurationOK, so its not that configuration-free. You dont get something for nothing in this life. However, dontdespair. All you need to do is generate a security key for the cluster and try again:# mkdir /etc/cluster# corosync-keygen# mv /etc/ais/authkey /etc/cluster/cman_authkeyYou now need to copy this file to all the nodes you want to join in this cluster. Once youve done thattry cman_tool join -X again and it should magically start up. If you start up the other nodes then theywill join the same cluster.cman_tool join -X sets up several defaults to get you going. Maybe the first thing you will notice is thatthe node numbers are HUGE! They are the IP addresses of the nodes. This isnt a problem, it just looksa little unwieldy, especially on little-endian systems where adjacently numbered nodes might appear to
  11. 11. have random and massively different nodeids. eg:# cman_tool nodesNode Id Sts Name16951488 M amy33728704 M anna50505920 M clara67283136 M fanny117614784 M nadia134392000 M sofiaAnyway, if you can cope with those this system should get you going with a simple single cluster thathas no fencing requirements (ie GFS cannot be used).You can override several things on the cman_tool command-line to add things like sensible nodeidsand a different cluster name or cluster ID. The last two will allow you to run multiple configuration-less clusters on a single network. By default all nodes will join a cluster called “RHCluster”.Actually, you can change the nodeids by adding -n<nodeid> to the cman_tool command-line, but thatprobably doesnt fit in too well with the zero-configuration system I suppose. See the cman_tool manpage for a list of options.Quorum is the big problem with this system. There is no way to know how many nodes are needed insuch a cluster. You can set expected_votes on the cman_tool command-line, and I urge you to do so,but without this quorum will default to 1. Yes ONE. Your cluster will always be quorate. I know thisgoes against the principles of a lot of clustering but there are two main reasons why it is done this way.1. Its easy to get going. A lot of the time, this mode will be used by people just testing clusters andthey want to get going and see what all the fuss is about. Once they understand clusters they can getinvolved in sorting out quorum, fencing and configuration files.2. A lot of the time these clusters are run as virtual machines in systems like Xen or KVM. In this casethere is no danger of a “split-brain” occurring as the systems are either all on the same host computer oron nodes that are already part of a properly configured cluster with real hardware fencing.9. Migrating code from libcmanThis chapter is just a list of libcman calls showing you where in corosync to find equivalents. Most ofthem have very similar names so it shouldnt be too hard I hope. The compatibility libcman providesmost of these calls so you might like to look at the source code for that library if you are unsure abouthow to achieve a particular result.cman_admin_initcman_initcman_finish
  12. 12. cman_setprivdatacman_getprivdatacman_start_notificationcman_stop_notificationcman_start_confchgcman_stop_confchgcman_get_fdcman_dispatchUse the equivalent calls for the subsystem(s) you are using.cman_set_votescman_set_expected_votescman_set_dirtycman_register_quorum_devicecman_unregister_quorum_devicecman_poll_quorum_devicecman_get_quorum_devicecman_get_node_countcman_get_nodescman_get_disallowed_nodescman_get_nodeUse libvotequorum. Note though that this does not provide you with node names for the cman_get_calls.cman_is_quorateUse libquorum, or libvotequorum. You should only use libvotequorum if you are also using otherfeatures of that module. If all you need to know is the quorum status then you should use libquorum asthat will work regardless of quorum provider.cman_get_versioncman_set_versioncman_get_clustercman_set_debuglogUse libconfdb or libccs to read/write to the objdb.cman_kill_nodecman_shutdowncman_replyto_shutdowncman_get_node_addrsUse libcfgcman_leave_clusterIf you just need to leave the cluster then libcfg will work. If you need to leave the cluster and make theremaining nodes keep quorum then you also need to tell votequorum.cman_send_datacman_start_recv_datacman_end_recv_data
  13. 13. Use libcpgcman_get_node_extracman_get_subsys_countcman_is_activecman_is_listeningcman_barrier_registercman_barrier_changecman_barrier_waitcman_barrier_deletecman_get_extra_infocman_get_fenceinfocman_node_fencedThese calls are not implemented at all, sorry. Youre on your own.