Linux  High Availability Kris Buytaert
Kris Buytaert @krisbuytaert <ul><li>I used to be a Dev, Then Became an Op
Senior Linux and Open Source Consultant @inuits.be
„ Infrastructure Architect“
Building Clouds since before the Cloud
Surviving the 10 th  floor test
Co-Author of some books
Guest Editor at some sites </li></ul>
What is HA Clustering ? <ul><li>One service goes down  </li><ul><li>=> others take over its work </li></ul><li>IP address ...
Not  designed for high-performance
Not designed for high troughput (load balancing) </li></ul>
Does it Matter ? <ul><li>Downtime is expensive
You mis out on $$$
Your boss complains
New users don't return </li></ul>
Lies, Damn Lies, and Statistics Counting nines (slide by Alan R)
The Rules of HA <ul><li>Keep it Simple
Keep it Simple
Prepare for Failure
Complexity is the enemy of reliability
Test your HA setup  </li></ul>
Myths <ul><li>Virtualization will solve your HA Needs
Live migration is the solution to all your problems
VM mirroring is the solution to all your problems
HA will make your platform more stable </li></ul>
Eliminating the SPOF <ul><li>Find out what Will Fail  </li><ul><li>Disks
Fans
Power (Supplies) </li></ul><li>Find out what Can Fail </li><ul><li>Network
Going Out Of Memory  </li></ul></ul>
Split Brain <ul><li>Communications failures can lead to separated partitions of the cluster
If those partitions each try and take control of the cluster, then it's called a split-brain condition
If this happens, then bad things will happen </li><ul><li>http://linux-ha.org/BadThingsWillHappen </li></ul></ul>
You care about ? <ul><li>Your data ? </li><ul><li>Consistent
Realitime
Eventual Consistent  </li></ul><li>Your Connection </li><ul><li>Always
Most of the time </li></ul></ul>
Shared Storage <ul><li>Shared Storage
Filesystem  </li><ul><li>e.g GFS, GpFS </li></ul><li>Replicated ?
Exported  Filesystem ?
$$$  1+1 <> 2
Storage = SPOF
Split Brain :(
Stonith </li></ul>
(Shared) Data <ul><li>Issues :  </li><ul><li>Who Writes  ?
Who Reads ?
What if 2 Active application want to write ?
What if an active server crashes during writing ?
Can we accept delays ?
Can we accept readonly data ? </li></ul><li>Hardware Requirements
Filesystem Requirements (GFS, GpFS, ...)  </li></ul>
DRBD <ul><li>Distributed Replicated Block Device
In the Linux Kernel (as of very recent)
Usually only 1 mount </li><ul><li>Multi mount as of 8.X  </li><ul><li>Requires  GFS / OCFS2 </li></ul></ul><li>Regular FS ...
Only 1 application instance Active accessing data
Upon Failover  application needs to be started on other node </li></ul>
Upcoming SlideShare
Loading in...5
×

Linux-HA with Pacemaker

6,997

Published on

My Linux HA with Pacemaker presentation

As given at #load11

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
6,997
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
201
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Linux-HA with Pacemaker

  1. 1. Linux High Availability Kris Buytaert
  2. 2. Kris Buytaert @krisbuytaert <ul><li>I used to be a Dev, Then Became an Op
  3. 3. Senior Linux and Open Source Consultant @inuits.be
  4. 4. „ Infrastructure Architect“
  5. 5. Building Clouds since before the Cloud
  6. 6. Surviving the 10 th floor test
  7. 7. Co-Author of some books
  8. 8. Guest Editor at some sites </li></ul>
  9. 9. What is HA Clustering ? <ul><li>One service goes down </li><ul><li>=> others take over its work </li></ul><li>IP address takeover, service takeover,
  10. 10. Not designed for high-performance
  11. 11. Not designed for high troughput (load balancing) </li></ul>
  12. 12. Does it Matter ? <ul><li>Downtime is expensive
  13. 13. You mis out on $$$
  14. 14. Your boss complains
  15. 15. New users don't return </li></ul>
  16. 16. Lies, Damn Lies, and Statistics Counting nines (slide by Alan R)
  17. 17. The Rules of HA <ul><li>Keep it Simple
  18. 18. Keep it Simple
  19. 19. Prepare for Failure
  20. 20. Complexity is the enemy of reliability
  21. 21. Test your HA setup </li></ul>
  22. 22. Myths <ul><li>Virtualization will solve your HA Needs
  23. 23. Live migration is the solution to all your problems
  24. 24. VM mirroring is the solution to all your problems
  25. 25. HA will make your platform more stable </li></ul>
  26. 26. Eliminating the SPOF <ul><li>Find out what Will Fail </li><ul><li>Disks
  27. 27. Fans
  28. 28. Power (Supplies) </li></ul><li>Find out what Can Fail </li><ul><li>Network
  29. 29. Going Out Of Memory </li></ul></ul>
  30. 30. Split Brain <ul><li>Communications failures can lead to separated partitions of the cluster
  31. 31. If those partitions each try and take control of the cluster, then it's called a split-brain condition
  32. 32. If this happens, then bad things will happen </li><ul><li>http://linux-ha.org/BadThingsWillHappen </li></ul></ul>
  33. 33. You care about ? <ul><li>Your data ? </li><ul><li>Consistent
  34. 34. Realitime
  35. 35. Eventual Consistent </li></ul><li>Your Connection </li><ul><li>Always
  36. 36. Most of the time </li></ul></ul>
  37. 37. Shared Storage <ul><li>Shared Storage
  38. 38. Filesystem </li><ul><li>e.g GFS, GpFS </li></ul><li>Replicated ?
  39. 39. Exported Filesystem ?
  40. 40. $$$ 1+1 <> 2
  41. 41. Storage = SPOF
  42. 42. Split Brain :(
  43. 43. Stonith </li></ul>
  44. 44. (Shared) Data <ul><li>Issues : </li><ul><li>Who Writes ?
  45. 45. Who Reads ?
  46. 46. What if 2 Active application want to write ?
  47. 47. What if an active server crashes during writing ?
  48. 48. Can we accept delays ?
  49. 49. Can we accept readonly data ? </li></ul><li>Hardware Requirements
  50. 50. Filesystem Requirements (GFS, GpFS, ...) </li></ul>
  51. 51. DRBD <ul><li>Distributed Replicated Block Device
  52. 52. In the Linux Kernel (as of very recent)
  53. 53. Usually only 1 mount </li><ul><li>Multi mount as of 8.X </li><ul><li>Requires GFS / OCFS2 </li></ul></ul><li>Regular FS ext3 ...
  54. 54. Only 1 application instance Active accessing data
  55. 55. Upon Failover application needs to be started on other node </li></ul>
  56. 56. DRBD(2) <ul><li>What happens when you pull the plug of a Physical machine ? </li><ul><li>Minimal Timeout
  57. 57. Why did the crash happen ?
  58. 58. Is my data still correct ? </li></ul></ul>
  59. 59. Alternatives to DRBD <ul><li>GlusterFS looked promising </li><ul><li>“Friends don't let Friends use Gluster”
  60. 60. Consistency problems
  61. 61. Stability Problems
  62. 62. Maybe later </li></ul><li>MogileFS </li><ul><li>Not posix
  63. 63. App needs to implement the API </li></ul><li>Ceph </li><ul><li>? </li></ul></ul>
  64. 64. HA Projects <ul><li>Linux HA Project
  65. 65. Red Hat Cluster Suite
  66. 66. LVS/Keepalived
  67. 67. Application Specific Clustering Software </li><ul><ul><li>e.g Terracotta, MySQL NDBD </li></ul></ul></ul>
  68. 68. Heartbeat <ul><li>Heartbeat v1 </li><ul><li>Max 2 nodes
  69. 69. No finegrained resources
  70. 70. Monitoring using “mon” </li></ul><li>Heartbeat v2 </li><ul><li>XML usage was a consulting opportunity
  71. 71. Stability issues
  72. 72. Forking ? </li></ul></ul>
  73. 73. Heartbeat v1 /etc/ha.d/ha.cf /etc/ha.d/haresources mdb-a.menos.asbucenter.dz ntc-restart-mysql mon IPaddr2::10.8.0.13/16/bond0 IPaddr2::10.16.0.13/16/bond0.16 mon /etc/ha.d/authkeys
  74. 74. Heartbeat v2 “A consulting Opportunity” LMB
  75. 75. Clone Resource Clones in v2 were buggy Resources were started on 2 nodes Stopped again on “1”
  76. 76. Heartbeat v3 <ul><li>No more /etc/ha.d/haresources
  77. 77. No more xml
  78. 78. Better integrated monitoring
  79. 79. /etc/ha.d/ha.cf has
  80. 80. crm=yes </li></ul>
  81. 81. Pacemaker ? <ul><li>Not a fork
  82. 82. Only CRM Code taken out of Heartbeat
  83. 83. As of Heartbeat 2.1.3 </li><ul><li>Support for both OpenAIS / HeartBeat
  84. 84. Different Release Cycles as Heartbeat </li></ul></ul>
  85. 85. Heartbeat, OpenAis, Corosync ? <ul><li>All Messaging Layers
  86. 86. Initially only Heartbeat
  87. 87. OpenAIS
  88. 88. Heartbeat got unmaintained
  89. 89. OpenAIS had heisenbugs :(
  90. 90. Corosync
  91. 91. Heartbeat maintenance taken over by LinBit
  92. 92. CRM Detects which layer </li></ul>
  93. 93. or OpenAIS Heartbeat Pacemaker Cluster Glue
  94. 94. <ul><li>Stonithd : The Heartbeat fencing subsystem.
  95. 95. Lrmd : Local Resource Management Daemon. Interacts directly with resource agents (scripts).
  96. 96. pengine Policy Engine. Computes the next state of the cluster based on the current state and the configuration.
  97. 97. cib Cluster Information Base. Contains definitions of all cluster options, nodes, resources, their relationships to one another and current status. Synchronizes updates to all cluster nodes.
  98. 98. crmd Cluster Resource Management Daemon. Largely a message broker for the PEngine and LRM, it also elects a leader to co-ordinate the activities of the cluster.
  99. 99. openais messaging and membership layer.
  100. 100. heartbeat messaging layer, an alternative to OpenAIS.
  101. 101. ccm Short for Consensus Cluster Membership. The Heartbeat membership layer. </li></ul>Pacemaker Architecture
  102. 102. Configuring Heartbeat with puppet heartbeat::hacf {&quot;clustername&quot;: hosts => [&quot;host-a&quot;,&quot;host-b&quot;], hb_nic => [&quot;bond0&quot;], hostip1 => [&quot;10.0.128.11&quot;], hostip2 => [&quot;10.0.128.12&quot;], ping => [&quot;10.0.128.4&quot;], } heartbeat::authkeys {&quot;ClusterName&quot;: password => “ClusterName &quot;, } http://github.com/jtimberman/puppet/tree/master/heartbeat/
  103. 103. CRM <ul><li>Cluster Resource Manager
  104. 104. Keeps Nodes in Sync
  105. 105. XML Based
  106. 106. cibadm
  107. 107. Cli manageable
  108. 108. Crm </li></ul>configure property $id=&quot;cib-bootstrap-options&quot; stonith-enabled=&quot;FALSE&quot; no-quorum-policy=ignore start-failure-is-fatal=&quot;FALSE&quot; rsc_defaults $id=&quot;rsc_defaults-options&quot; migration-threshold=&quot;1&quot; failure-timeout=&quot;1&quot; primitive d_mysql ocf:local:mysql op monitor interval=&quot;30s&quot; params test_user=&quot;sure&quot; test_passwd=&quot;illtell&quot; test_table=&quot;test.table&quot; primitive ip_db ocf:heartbeat:IPaddr2 params ip=&quot;172.17.4.202&quot; nic=&quot;bond0&quot; op monitor interval=&quot;10s&quot; group svc_db d_mysql ip_db commit
  109. 109. Heartbeat Resources <ul><li>LSB
  110. 110. Heartbeat resource (+status)
  111. 111. OCF (Open Cluster FrameWork) (+monitor)
  112. 112. Clones (don't use in HAv2)
  113. 113. Multi State Resources </li></ul>
  114. 114. LSB Resource Agents <ul><li>LSB == Linux Standards Base
  115. 115. LSB resource agents are standard System V-style init scripts commonly used on Linux and other UNIX-like OSes
  116. 116. LSB init scripts are stored under /etc/init.d/
  117. 117. This enables Linux-HA to immediately support nearly every service that comes with your system, and most packages which come with their own init script
  118. 118. It's straightforward to change an LSB script to an OCF script </li></ul>
  119. 119. OCF <ul><li>OCF == Open Cluster Framework
  120. 120. OCF Resource agents are the most powerful type of resource agent we support
  121. 121. OCF RAs are extended init scripts </li><ul><li>They have additional actions: </li><ul><li>monitor – for monitoring resource health
  122. 122. meta-data – for providing information about the RA </li></ul></ul><li>OCF RAs are located in /usr/lib/ocf/resource.d/provider-name/ </li></ul>
  123. 123. Monitoring <ul><li>Defined in the OCF Resource script
  124. 124. Configured in the parameters
  125. 125. You have to support multiple states </li><ul><li>Not running
  126. 126. Running
  127. 127. Failed </li></ul></ul>
  128. 128. Anatomy of a Cluster config <ul><li>Cluster properties
  129. 129. Resource Defaults
  130. 130. Primitive Definitions
  131. 131. Resource Groups and Constraints </li></ul>
  132. 132. Cluster Properties property $id=&quot;cib-bootstrap-options&quot; stonith-enabled=&quot;FALSE&quot; no-quorum-policy=&quot;ignore&quot; start-failure-is-fatal=&quot;FALSE&quot; No-quorum-policy = We'll ignore the loss of quorum on a 2 node cluster Start-failure : When set to FALSE, the cluster will instead use the resource's failcount and value for resource-failure-stickiness
  133. 133. Resource Defaults rsc_defaults $id=&quot;rsc_defaults-options&quot; migration-threshold=&quot;1&quot; failure-timeout=&quot;1&quot; resource-stickiness=&quot;INFINITY&quot; failure-timeout means that after a failure there will be a 60 second timeout before the resource can come back to the node on which it failed. Migration-treshold=1 means that after 1 failure the resource will try to start on the other node Resource-stickiness=INFINITY means that the resource really wants to stay where it is now.
  134. 134. Primitive Definitions primitive d_mine ocf:custom:tomcat params instance_name=&quot;mine&quot; monitor_urls=&quot;health.html&quot; monitor_use_ssl=&quot;no&quot; op monitor interval=&quot;15s&quot; on-fail=&quot;restart&quot; primitive ip_mine_svc ocf:heartbeat:IPaddr2 params ip=&quot;10.8.4.131&quot; cidr_netmask=&quot;16&quot; nic=&quot;bond0&quot; op monitor interval=&quot;10s&quot;
  135. 135. Parsing a config <ul><li>Isn't always done correctly
  136. 136. Even a verify won't find all issues
  137. 137. Unexpected behaviour might occur </li></ul>
  138. 138. Where a resource runs <ul><li>multi state resources </li><ul><li>Master – Slave , </li><ul><li>e.g mysql master-slave, drbd </li></ul></ul><li>Clones </li><ul><li>Resources that can run on multiple nodes </li><ul><li>e.g
  139. 139. Multimaster mysql servers
  140. 140. Mysql slaves
  141. 141. Stateless applications </li></ul></ul><li>location </li><ul><li>Preferred location to run resource, eg. Based on hostname </li></ul><li>colocation </li><ul><li>Resources that have to live together </li><ul><li>e.g ip address + service </li></ul></ul><li>order </li><ul><li>Define what resource has to start first, or wait for another resource </li></ul><li>groups </li><ul><li>Colocation + order </li></ul></ul>
  142. 142. eg. A Service on DRBD <ul><li>DRBD can only be active on 1 node
  143. 143. The filesystem needs to be mounted on that active DRBD node </li></ul>group svc_mine d_mine ip_mine ms ms_drbd_storage drbd_storage meta master_max=&quot;1&quot; master_node_max=&quot;1&quot; clone_max=&quot;2&quot; clone_node_max=&quot;1&quot; notify=&quot;true&quot; colocation fs_on_drbd inf: svc_mine ms_drbd_storage:Master order fs_after_drbd inf: ms_drbd_storage:promote svc_mine:start location cli-prefer-svc_db svc_db rule $id=&quot;cli-prefer-rule-svc_db&quot; inf: #uname eq db-a
  144. 144. Crm commands Crm Start the cluster resource manager Crm resource Change in to resource mode Crm configure Change into configure mode Crm configure show Show the current resource config Crm resource show Show the current resource state Cibadm -Q Dump the full Cluster Information Base in XML
  145. 145. Using crm <ul><li>Crm configure
  146. 146. Edit primitive
  147. 147. Verify
  148. 148. Commit </li></ul>
  149. 149. But We love XML <ul><li>Cibadm -Q </li></ul>
  150. 150. Checking the Cluster State crm_mon -1 ============ Last updated: Wed Nov 4 16:44:26 2009 Stack: Heartbeat Current DC: xms-1 (c2c581f8-4edc-1de0-a959-91d246ac80f5) - partition with quorum Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7 2 Nodes configured, unknown expected votes 2 Resources configured. ============ Online: [ xms-1 xms-2 ] Resource Group: svc_mysql d_mysql (ocf::ntc:mysql): Started xms-1 ip_mysql (ocf::heartbeat:IPaddr2): Started xms-1 Resource Group: svc_XMS d_XMS (ocf::ntc:XMS): Started xms-2 ip_XMS (ocf::heartbeat:IPaddr2): Started xms-2 ip_XMS_public (ocf::heartbeat:IPaddr2): Started xms-2
  151. 151. Stopping a resource crm resource stop svc_XMS crm_mon -1 ============ Last updated: Wed Nov 4 16:56:05 2009 Stack: Heartbeat Current DC: xms-1 (c2c581f8-4edc-1de0-a959-91d246ac80f5) - partition with quorum Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7 2 Nodes configured, unknown expected votes 2 Resources configured. ============ Online: [ xms-1 xms-2 ] Resource Group: svc_mysql d_mysql (ocf::ntc:mysql): Started xms-1 ip_mysql (ocf::heartbeat:IPaddr2): Started xms-1
  152. 152. Starting a resource crm resource start svc_XMS crm_mon -1 ============ Last updated: Wed Nov 4 17:04:56 2009 Stack: Heartbeat Current DC: xms-1 (c2c581f8-4edc-1de0-a959-91d246ac80f5) - partition with quorum Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7 2 Nodes configured, unknown expected votes 2 Resources configured. ============ Online: [ xms-1 xms-2 ] Resource Group: svc_mysql d_mysql (ocf::ntc:mysql): Started xms-1 ip_mysql (ocf::heartbeat:IPaddr2): Started xms-1 Resource Group: svc_XMS
  153. 153. Moving a resource <ul><li>Resource migrate
  154. 154. Is permanent , even upon failure
  155. 155. Usefull in upgrade scenarios
  156. 156. Use resource unmigrate to restore </li></ul>
  157. 157. Moving a resource [xpoll-root@XMS-1 ~]# crm resource migrate svc_XMS xms-1 [xpoll-root@XMS-1 ~]# crm_mon -1 Last updated: Wed Nov 4 17:32:50 2009 Stack: Heartbeat Current DC: xms-1 (c2c581f8-4edc-1de0-a959-91d246ac80f5) - partition with quorum Version: 1.0.5-462f1569a43740667daf7b0f6b521742e9eb8fa7 2 Nodes configured, unknown expected votes 2 Resources configured. Online: [ xms-1 xms-2 ] Resource Group: svc_mysql d_mysql (ocf::ntc:mysql): Started xms-1 ip_mysql (ocf::heartbeat:IPaddr2): Started xms-1 Resource Group: svc_XMS d_XMS (ocf::ntc:XMS): Started xms-1 ip_XMS (ocf::heartbeat:IPaddr2): Started xms-1 ip_XMS_public (ocf::heartbeat:IPaddr2): Started xms-1
  158. 158. Migrate vs Standby <ul><li>Think nrofnodes > 2 clusters
  159. 159. Migrate : send resource to node X </li><ul><li>Only use that available one </li></ul><li>Standby : do not send resources to node X </li><ul><li>But use the other available ones </li></ul></ul>
  160. 160. Debugging <ul><li>Check crm_mon -f
  161. 161. Failcounts ?
  162. 162. Did the application launch correctly ?
  163. 163. /var/log/messages/ </li><ul><li>Warning: very verbose </li></ul></ul>
  164. 164. Resource not running [menos-val3-root@mrs-a ~]# crm crm(live)# resource crm(live)resource# show Resource Group: svc-MRS d_MRS (ocf::ntc:tomcat) Stopped ip_MRS_svc (ocf::heartbeat:IPaddr2) Stopped ip_MRS_usr (ocf::heartbeat:IPaddr2) Stopped
  165. 165. Resource Failcount [menos-val3-root@mrs-a ~]# crm crm(live)# resource crm(live)resource# failcount d_MRS show mrs-a scope=status name=fail-count-d_MRS value=1 crm(live)resource# failcount d_MRS delete mrs-a crm(live)resource# failcount d_MRS show mrs-a scope=status name=fail-count-d_MRS value=0
  166. 166. Resource Failcount [menos-val3-root@mrs-a ~]# crm crm(live)# resource crm(live)resource# failcount d_MRS show mrs-a scope=status name=fail-count-d_MRS value=1 crm(live)resource# failcount d_MRS delete mrs-a crm(live)resource# failcount d_MRS show mrs-a scope=status name=fail-count-d_MRS value=0
  167. 167. Resource Failcount [menos-val3-root@mrs-a ~]# crm crm(live)# resource crm(live)resource# failcount d_MRS show mrs-a scope=status name=fail-count-d_MRS value=1 crm(live)resource# failcount d_MRS delete mrs-a crm(live)resource# failcount d_MRS show mrs-a scope=status name=fail-count-d_MRS value=0
  168. 168. Pacemaker and Puppet <ul><li>Plenty of non usable modules around </li><ul><li>Hav1 </li></ul><li>https://github.com/rodjek/puppet-pacemaker.git </li><ul><li>Strict set of ops / parameters </li></ul><li>Make sure your modules don't enable resources
  169. 169. I've been using templates till to populate
  170. 170. Cibadm to configure
  171. 171. Crm is complex , even crm doesn't parse correctly yet
  172. 172. Plenty of work ahead ! </li></ul>
  173. 173. Getting Help <ul><li>http://clusterlabs.org
  174. 174. #linux-ha on irc.freenode.org
  175. 175. http://www.drbd.org/users-guide/ </li></ul>
  176. 176. Contact : Kris Buytaert [email_address] Further Reading @krisbuytaert http://www.krisbuytaert.be/blog/ http://www.inuits.be/ http://www.virtualization.com/ http://www.oreillygmt.com/ Esquimaux Kheops Business Center Avenque Georges Lemaître 54 6041 Gosselies 889.780.406 +32 495 698 668 Inuits 't Hemeltje Gemeentepark 2 2930 Brasschaat 891.514.231 +32 473 441 636
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×