Kris Buytaert
● I used to be a Dev, Then Became an Op,
● Today I feel like a dev again
● Senior Linux and Open Source Consultant @inuits.be
● „Infrastructure Architect“
● Building Clouds since before the Cloud
● Surviving the 10th floor test
● Co-Author of some books
● Guest Editor at some sites
What is HA Clustering ?
● One service goes down
=> others take over its work
● IP address takeover, service takeover,
● Not designed for high-performance
● Not designed for high troughput (load
balancing)
Lies, Damn Lies, and
Statistics
Counting nines
(slide by Alan R)
99.9999% 30 sec
99.999% 5 min
99.99% 52 min
99.9% 9 hr
99% 3.5 day
The Rules of HA
● Keep it Simple
● Keep it Simple
● Prepare for Failure
● Complexity is the enemy of reliability
● Test your HA setup
Eliminating the SPOF
● Find out what Will Fail
• Disks
• Fans
• Power (Supplies)
● Find out what Can Fail
• Network
• Going Out Of Memory
Data vs Connection
● DATA :
• Replication
• Shared storage
• DRBD
● Connection
• LVS
• Proxy
• Heartbeat / Pacemaker
DRBD
● Distributed Replicated Block Device
● In the Linux Kernel
● Usually only 1 mount
• Multi mount as of 8.X
• Requires GFS / OCFS2
● Regular FS ext3 ...
● Only 1 MySQL instance Active accessing data
● Upon Failover MySQL needs to be started on
other node
DRBD(2)
● What happens when you pull the plug of a
Physical machine ?
• Minimal Timeout
• Why did the crash happen ?
• Is my data still correct ?
• Innodb Consistency Checks ?
• Lengthy ?
• Check your BinLog size
Other Solutions Today
● MySQL Cluster NDBD
● Multi Master Replication
● MySQL Proxy
● MMM
● Flipper
● BYO
● ....
Pulling Traffic
● Eg. for Cluster, MultiMaster setups
• DNS
• Advanced Routing
• LVS
• Or the upcoming slides
Linux-HA PaceMaker
● Plays well with others
● Manages more than MySQL
●
● ...v3 .. don't even think about the rest anymore
●
● http://clusterlabs.org/
Heartbeat v1
• Max 2 nodes
• No finegrained resources
• Monitoring using “mon”
/etc/ha.d/ha.cf
/etc/ha.d/haresources
mdb-a.menos.asbucenter.dz ntc-restart-mysql mon IPaddr2::10.8.0.13/16/bond0
IPaddr2::10.16.0.13/16/bond0.16 mon
/etc/ha.d/authkeys
Heartbeat v3
• No more /etc/ha.d/haresources
• No more xml
• Better integrated monitoring
• /etc/ha.d/ha.cf has
• crm=yes
Pacemaker ?
● Not a fork
● Only CRM Code taken out of Heartbeat
● As of Heartbeat 2.1.3
• Support for both OpenAIS / HeartBeat
• Different Release Cycles as Heartbeat
Heartbeat, OpenAis,
Corosync ?
● All Messaging Layers
● Initially only Heartbeat
● OpenAIS
● Heartbeat got unmaintained
● OpenAIS had heisenbugs :(
● Corosync
● Heartbeat maintenance taken over by LinBit
● CRM Detects which layer
● Stonithd : The Heartbeat fencing subsystem.
Pacemaker Architecture
● Lrmd : Local Resource Management Daemon. Interacts
directly with resource agents (scripts).
● pengine Policy Engine. Computes the next state of the
cluster based on the current state and the configuration.
● cib Cluster Information Base. Contains definitions of all
cluster options, nodes, resources, their relationships to
one another and current status. Synchronizes updates to
all cluster nodes.
● crmd Cluster Resource Management Daemon. Largely
a message broker for the PEngine and LRM, it also elects
a leader to co-ordinate the activities of the cluster.
● openais messaging and membership layer.
● heartbeat messaging layer, an alternative to OpenAIS.
● ccm Short for Consensus Cluster Membership. The
Heartbeat membership layer.
CRM
configure
property $id="cibbootstrapoptions"
● Cluster Resource stonithenabled="FALSE"
noquorumpolicy=ignore
Manager startfailureisfatal="FALSE"
rsc_defaults $id="rsc_defaultsoptions"
migrationthreshold="1"
● Keeps Nodes in Sync failuretimeout="1"
primitive d_mysql ocf:local:mysql
op monitor interval="30s"
params test_user="sure" test_passwd="illtell"
● XML Based test_table="test.table"
primitive ip_db ocf:heartbeat:IPaddr2
params ip="172.17.4.202" nic="bond0"
● cibadm op monitor interval="10s"
group svc_db d_mysql ip_db
commit
● Cli manageable
● Crm
Heartbeat Resources
● LSB
● Heartbeat resource (+status)
● OCF (Open Cluster FrameWork) (+monitor)
● Clones (don't use in HAv2)
● Multi State Resources
LSB Resource Agents
● LSB == Linux Standards Base
● LSB resource agents are standard System V-
style init scripts commonly used on Linux and
other UNIX-like OSes
● LSB init scripts are stored under /etc/init.d/
● This enables Linux-HA to immediately support
nearly every service that comes with your
system, and most packages which come with
their own init script
● It's straightforward to change an LSB script to
an OCF script
OCF
● OCF == Open Cluster Framework
● OCF Resource agents are the most powerful type of
resource agent we support
● OCF RAs are extended init scripts
• They have additional actions:
• monitor – for monitoring resource health
• meta-data – for providing information about the RA
● OCF RAs are located in
/usr/lib/ocf/resource.d/provider-name/
Monitoring
● Defined in the OCF Resource script
● Configured in the parameters
● You have to support multiple states
• Not running
• Running
• Failed
Anatomy of a Cluster
config
• Cluster properties
• Resource Defaults
• Primitive Definitions
• Resource Groups and Constraints
Cluster Properties
property $id="cib-bootstrap-options"
stonith-enabled="FALSE"
no-quorum-policy="ignore"
start-failure-is-fatal="FALSE"
No-quorum-policy = We'll ignore the loss of quorum on a 2 node cluster
Start-failure : When set to FALSE, the cluster will instead use the resource's failcount and value for resource-failure-
stickiness
Resource Defaults
rsc_defaults $id="rsc_defaults-options"
migration-threshold="1"
failure-timeout="1"
resource-stickiness="INFINITY"
failure-timeout means that after a failure there will be a 60 second timeout before the resource can come back to the
node on which it failed.
Migration-treshold=1 means that after 1 failure the resource will try to start on the other node
Resource-stickiness=INFINITY means that the resource really wants to stay where it is now.
Parsing a config
● Isn't always done correctly
● Even a verify won't find all issues
● Unexpected behaviour might occur
Where a resource runs
• multi state resources
• Master – Slave ,
• e.g mysql master-slave, drbd
• Clones
• Resources that can run on multiple nodes
e.g
• Multimaster mysql servers
• Mysql slaves
• Stateless applications
• location
• Preferred location to run resource, eg. Based on hostname
• colocation
• Resources that have to live together
• e.g ip address + service
• order
Define what resource has to start first, or wait for another resource
• groups
• Colocation + order
eg. A Service on DRBD
● DRBD can only be active on 1 node
● The filesystem needs to be mounted on that
active DRBD node
group svc_mine d_mine ip_mine
ms ms_drbd_storage drbd_storage
meta master_max="1" master_node_max="1" clone_max="2" clone_node_max="1"
notify="true"
colocation fs_on_drbd inf: svc_mine ms_drbd_storage:Master
order fs_after_drbd inf: ms_drbd_storage:promote svc_mine:start
location cli-prefer-svc_db svc_db
rule $id="cli-prefer-rule-svc_db" inf: #uname eq db-a
A MySQL Resource
● OCF
• Clone
• Where do you hook up the IP ?
• Multi State
• But we have Master Master replication
• Meta Resource
• Dummy resource that can monitor
• Connection
• Replication state
Simple 2 node example
primitive d_mysql ocf:ntc:mysql
op monitor interval="30s"
params test_user="just" test_passwd="kidding" test_table="really"
primitive ip_mysql_svc ocf:heartbeat:IPaddr2
params ip="10.8.0.30" cidr_netmask="255.255.255.0"
nic="bond0"
op monitor interval="10s"
group svc_mysql d_mysql ip_mysql_svc
Monitor your Setup
● Not just connectivity
● Also functional
• Query data
• Check resultset is correct
● Check replication
• MaatKit
• OpenARK
How to deal with replication state ?
● Multiple slaves
• Use Drbd ocf resource
● 2 masters only use own script
• Replication is slow on the active node
• Shouldn't happen talk to HR / cfgmt people
• Replication is slow on the passive node
• Weight--
• Replication breaks on the active node
send out warning, don't modify weights and check other node
• Replication breaks on the passive node
• Fence of the passive node
Adding MySQL to the
stack
Replication
Service IP MySQL
“MySQLd” “MySQLd” Resource MySQL
Cluster Stack
Pacemaker
HeartBeat
Node A Node B Hardware
Conclusion
● Plenty of Alternatives
● Think about your Data
● Think about getting Queries to that Data
● Complexity is the enemy of reliability
● Keep it Simple
● Monitor inside the DB
Contact
Kris Buytaert Kris.Buytaert@inuits.be
Further Reading
@KrisBuytaert
http://www.krisbuytaert.be/blog/
http://www.inuits.be/
http://www.virtualization.com/
http://www.oreillygmt.com/
Inuits Esquimaux
't Hemeltje Kheops Business
Gemeentepark 2 Center
2930 Brasschaat Avenque Georges
891.514.231 Lemaître 54
6041 Gosselies
+32 473 441 636 889.780.406
+32 495 698 668