Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache CloudStack - CloudStack European User Group Virtual, May 2021

Having High Availability enabled for KVM Hosts can improve greatly the QoS by handling (fence/recover) a problematic Host as well as re-starting its stopped VMs on healthy hosts. However, there is a limitation on CloudStack HA for KVM; it relies mainly on NFS heartbeat script checks. This Talk illustrates how CloudStack HA works for KVM hosts and it presents a way of improving its implementation in a way that KVM HA works with any storage system pluggable on KVM, not just NFS.

About Gabriel Brasher - https://blogs.apache.org/cloudstack/
------------------------------------------

CloudStack European User Group Virtual happened on May 27th. The first CSEUG Virtual proved to be a huge success. It collected people from 23 countries – Germany, the United Kingdom, Switzerland, India, Bulgaria, Greece, Poland, Serbia, Brazil, Chile, Russia, USA, Canada, Japan, France, Uruguay, Korea …
We also had a record number of registrations and attendees for a CloudStack User Group Event. The physical distance was not a stopper for our speakers, who joined the event from 6 different countries.
------------------------------------------
About CloudStack: https://cloudstack.apache.org/

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

  • Be the first to like this

KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache CloudStack - CloudStack European User Group Virtual, May 2021

  1. 1. KVM High Availability regardless of storage CloudStack™ European User Group Virtual - May 27th 2021
  2. 2. Who am I? gabriel@apache.org • Gabriel Beims Bräscher, Brazilian • Software Developer at PCextreme B.V. ○ Dutch hosting company founded in 2004 • 2013: First time using CloudStack (CloudStack 4.1.0) • 2017: Apache CloudStack Committer • 2019: CloudStack Project Management Committee (PMC) • 2021: Appointed by the ASF as PMC Chair (VP) of CloudStack CloudStack™ European User Group Virtual - May 27th 2021
  3. 3. • CloudStack KVM HA • Health Check with NFS • Can we have KVM HA without NFS? • KVM HA regardless of storage • Take away: future Summary What this presentation brings? CloudStack™ European User Group Virtual - May 27th 2021
  4. 4. CloudStack KVM HA Why configure HA for Hosts? Why? • Improve QoS ○ VMs should run as much as possible ○ Hosts should not stay “Down” CloudStack™ European User Group Virtual - May 27th 2021
  5. 5. CloudStack KVM HA Why configure HA for Hosts? How it works? Why? • Improve QoS ○ VMs should run as much as possible ○ Hosts should not stay “Down” How? • Detect problematic Host • Re-start its stopped VMs CloudStack™ European User Group Virtual - May 27th 2021
  6. 6. Why? • Improve QoS ○ VMs should run as much as possible ○ Hosts should not stay “Down” How? • Detect problematic Host • Recover or Fence it • Re-start its stopped VMs We don’t want 2 VMs mapped to same storage path • CloudStack cannot reach a Host • VMs are still running and writing/reading on storage CloudStack KVM HA Why configure HA for Hosts? How it works? CloudStack™ European User Group Virtual - May 27th 2021
  7. 7. CloudStack KVM HA Why configure HA for Hosts? How it works? HA States CloudStack™ European User Group Virtual - May 27th 2021 Link: https://github.com/apache/cloudstack/blob/master/api/src/main/java/org/apache/cloudstack/ha/HAConfig.java Host HA States • Disabled: HA Operations disabled • Available: The resource is healthy • Ineligible: The current state does not support HA/recovery • Suspect: Most recent health check failed • Degraded: The resource cannot be managed, but services end user requests • Checking: The activity checks are currently being performed • Recovering: The resource is undergoing recovery operation • Recovered: The resource is recovered • Fencing: The resource is undergoing fence operation • Fenced: The resource is fenced
  8. 8. CloudStack KVM HA Why configure HA for Hosts? How it works? HA States CloudStack™ European User Group Virtual - May 27th 2021 Link: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
  9. 9. Out-of-band management • IPMI • Redfish (CloudStack +4.15.0) Enable HA • VMs Service offerings enabled for HA • Hosts enabled for HA Use NFS as shared primary storage pool CloudStack KVM HA Why configure HA for Hosts? How it works? HA States Requirements CloudStack™ European User Group Virtual - May 27th 2021
  10. 10. Why NFS? • Hosts in the same cluster can check the same storage • Check the storage activity How it works? • HeartBeat script running on KVM nodes checks if can write/read on the mounted NFS partition Health Check with NFS Why use NFS? CloudStack™ European User Group Virtual - May 27th 2021
  11. 11. Health Check with NFS Today, with NFS CloudStack™ European User Group Virtual - May 27th 2021
  12. 12. Currently KVM HA works by monitoring an NFS based heartbeat file and it can often fail whenever this network share becomes slower, causing the hypervisors to reboot. This can be particularly annoying when you have different kinds of primary storages in place which are working fine (people running CEPH etc). ... This is embarrassing. How can we fix it? Ideas, suggestions? How are other hypervisors doing it? – Nux 09, October, 2015 JIRA Issue: CLOUDSTACK-8943 Health Check with NFS Why use NFS? CloudStack™ European User Group Virtual - May 27th 2021 Link: https://issues.apache.org/jira/browse/CLOUDSTACK-8943
  13. 13. Possible validations • Request to the CloudStack Agent (JVM) -- Java can crash • Check storage activity -- cost to implement & maintain (for each storage) • Check via Libvirt • Ping host -- Ping is limited and often firewalls can block Can we have KVM HA without NFS? What are the possible validations? CloudStack™ European User Group Virtual - May 27th 2021
  14. 14. KVM HA regardless of storage CloudStack + KVM + HA - NFS CloudStack™ European User Group Virtual - May 27th 2021 Possible validations • Request to the CloudStack Agent (JVM) -- Java can crash • Check storage activity -- cost to implement & maintain (for each storage) • Check via Libvirt • Ping host -- Ping is limited and often firewalls can block
  15. 15. KVM HA regardless of storage Today, with NFS CloudStack™ European User Group Virtual - May 27th 2021
  16. 16. KVM HA regardless of storage Proposal with KVM HA Agent Helper web-service CloudStack™ European User Group Virtual - May 27th 2021
  17. 17. KVM HA regardless of storage HTTP Request for checking neighbour hosts CloudStack™ European User Group Virtual - May 27th 2021
  18. 18. KVM HA regardless of storage What if NFS check fails? CloudStack™ European User Group Virtual - May 27th 2021
  19. 19. KVM HA regardless of storage What if NFS check fails? What if KVM HA Helper Fails? CloudStack™ European User Group Virtual - May 27th 2021
  20. 20. KVM HA regardless of storage What if NFS check fails? What if KVM HA Helper Fails? What if both fails? CloudStack™ European User Group Virtual - May 27th 2021
  21. 21. KVM HA regardless of storage In a nutshell CloudStack™ European User Group Virtual - May 27th 2021 HTTP Rest API that checks Libvirt - KVM HA Agent • The web-service runs Libvirt commands to list VMs ( ~$ virsh list ) • Checks neighbour hosts via the same agent • One can enable or disable the KVM HA Agent checks • If NFS is used on the cluster, it is also taken into account • If no NFS is used, Heart Beat checks are skipped Example: • HTTP GET -> http://host.name:8080/ ○ response: {"count": 3, "virtualmachines": ["r-123-VM", "v-134-VM", "s-111-VM"]} • HTTP GET -> http://host.name:8080/check-neighbour/neighbour.name:8080 ○ response: {"status": "Up"} OR {"status": "Down"}
  22. 22. KVM HA regardless of storage Possible outcomes All Good • HTTP Request gets a response listing VMs that matches DB Warning • HTTP Request gets a response but listed VMs does not match DB Recover/Fence • HTTP Request gets a response listing Zero VMs but according to the DB there are VMs running • HTTP Request gets an error code (e.g. 404), Service is not reachable CloudStack™ European User Group Virtual - May 27th 2021
  23. 23. • HA systems are critical and will always need attention • HA can be done regardless of storage • However, combining multiple checks can lead to robust systems • Code is already available at PR #4978 • Running on a test environment • Aim implementation for 4.16.0.0 or next LTS Take away Future CloudStack™ European User Group Virtual - May 27th 2021 Link for PR: https://github.com/apache/cloudstack/pull/4978
  24. 24. Thanks! Questions? #CSEUGvirtual #cloudstack #cloustackworks CloudStack™ European User Group Virtual - May 27th 2021 contact: gabriel@apache.org

×