SlideShare a Scribd company logo
1 of 17
The Cloud
Specialists
When the Cloud is a Rockin': High
Availability in Apache CloudStack
shapeblue.com • @ShapeBlue
John Burwell • @john_burwell
VP of Software Engineering
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
A b o u t M e
• VP of Software Engineering @ ShapeBlue
• Member, Apache CloudStack PMC (June
2013)
• Ran operations and designed automated
provisioning for analytic/virtualization clouds
• Led architectural design and server-side
development of a SaaS physical security
platform
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• Rohit Yadav
• Abhi Prateek
• Murali Reddy
• Boris Stoyanov
T h e r e ’ s N o “ I ” i n T e a m
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
M o t i v a t i o n
Currently [sic] KVM HA works by monitoring an NFS
based heartbeat file and it can often fail whenever
this network share becomes slower, causing the
hypervisors to reboot. … This is embarrassing. How
can we fix it? Ideas, suggestions? How are other
hypervisors doing it?
- Nux
15 October 2015
CLOUDSTACK-8943
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• Limited to hosts and VMs using NFS storage
• Tight coupling between the Agent and
HighAvailabilityManager
• False positives which interrupt the operation
healthy resources
L i m i t a t i o n s / I s s u e s
Inconsistent behavior prevents operators from trusting KVM HA
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
B u i l d v s . B u y
Pros
• Integration with the
CloudStack control plane and
abstractions
• Simpler configuration
• Integrated instrumentation
and logging
Cons
• Complex mechanism to
implement, test, and
maintain
• Foregoing a proven, battle
tested implementation
• Less functionality initially
A robust infrastructure control plane must include the ability to
recover and fence resources
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
H A R e s o u r c e M a n a g e m e n t S e r v i c e
HA Resource
Management Service
Plugin
•Manages per resource FSM
•Persistence
•Concurrency/Back Pressure
•Common Business Logic
•Resource-specific Business Logic
HA Provider
Resource
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• Loose coupling between resources and HA
• Consolidate orthogonal HA concerns
• Prove the correct operation of the HA Resource
Management Service and HA Providers
independently
• Leverage CloudStack abstractions
• Develop a model for architectural evolution
G o a l s
To create a trustworthy system, operational
correctness must be the prevailing priority
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• Health Check: An idempotent check of a resource to
directly verify its proper operation
• Activity Check: An idempotent check to observe the
side-effects of a resource’s proper operation
• Eligibility: An idempotent determination of a
resource’s eligibility for HA management
• Recovery: Take potentially destructive actions to
bring a resource back to a healthy state
• Fence: Take potentially destructive actions to
prevent an unrecoverable resource from impacting
the health of its peers
T e r m s a n d C o n c e p t s
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• DISABLED: The resource is part of a partition where HA operations have been disabled
or have been disabled for the resource.
• INITIALIZING: The initial health and eligibility of the resource for HA management is
currently being determined.
• AVAILABLE: The resource is available based on the passage of the most recent health
check and it containing partition has an HA state of ACTIVE.
• INELIGIBLE: The resource's enclosing partition has an HA state of ACTIVE but its
current state does not support HA check and/or recovery operations.
• SUSPECT: The resource pending an activity check due to failing its most recent health
check.
• CHECKING: An activity check is currently being performed on the resource.
• RECOVERING: Recovery operations are in-progress to bring the resource back to a
healthy state.
• DEGRADED: The resource cannot be managed by the control plane but passed its most
recent activity check indicating that the resource is still servicing end-user requests
• FENCED: The resource is not operating normally and automated attempts to recover it
failed. Manual operator intervention is required to recover the resource.
S t a t e s
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
S t a t e M o d e l
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
H A P r o v i d e r I n t e r f a c e
public interface HAProvider<R> extends Adapter {
ResourceType resourceType();
ResourceSubType resourceSubType();
boolean isEligible(R r);
boolean isHealthy(R r) throws HACheckerException;
boolean hasActivity(R r) throws HACheckerException;
boolean recover(R r) throws HARecoveryException;
boolean fence(R r) throws HAFenceException;
}
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
K V M H o s t H A
KVM Host HA Provider
Storage
Processor
Activity
Check
Host
Recover /
Fence using
OOBM
KVM Agent
Health
Check
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
C o n c u r r e n c y M o d e l
•Producer/consumer model
•Size bounded work queues
•Time bounded operations
•Fixed sized thread pools
•Idempotent operations are ephemeral
•Non-Idempotent operations are managed
through AsyncJobManager using a new time-
delayed dispatcher
HA operations cannot overwhelm the control plane
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• Focused on KVM host HA
• Initial implementation started — validating the
design
• Draft specification — functional spec will be
published in the next 1-2 weeks
• Robust unit and integration test model to verify
both the service and KVM host HA provider
• Delivery of the first version in July 2016 for
inclusion in 4.10 (August 2016)
S t a t u s
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
• Support Nested HA Resources
• Instrumentation
• Migrate VM HA to the HA Resource Management
Service
W h a t ’ s N e x t
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
Questions? Comments?
#cloudstackworks

More Related Content

What's hot

Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Lucas Jellema
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 

What's hot (20)

Lambda architecture with Spark
Lambda architecture with SparkLambda architecture with Spark
Lambda architecture with Spark
 
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
Modernizing Applications with Microservices and DC/OS (Lightbend/Mesosphere c...
 
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWSAWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
AWS April 2016 Webinar Series - Best Practices for Apache Spark on AWS
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
Triple C - Centralize, Cloudify and Consolidate Dozens of Oracle Databases (O...
 
Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN
Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARNKafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN
Kafka On YARN (KOYA): An Open Source Initiative to integrate Kafka & YARN
 
Data Antipatterns
Data AntipatternsData Antipatterns
Data Antipatterns
 
Riak at shareaholic
Riak at shareaholicRiak at shareaholic
Riak at shareaholic
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
 
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra DeploymentsBattery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
Battery Ventures: Simulating and Visualizing Large Scale Cassandra Deployments
 
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
How does Riak compare to Cassandra? [Cassandra London User Group July 2011]
 
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
Microsoft: Building a Massively Scalable System with DataStax and Microsoft's...
 
ASPgems - kappa architecture
ASPgems - kappa architectureASPgems - kappa architecture
ASPgems - kappa architecture
 
In Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging serviceIn Flux Limiting for a multi-tenant logging service
In Flux Limiting for a multi-tenant logging service
 
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, faster, cheaper infrastructure with apache cloud stack and riak cs reduxBetter, faster, cheaper infrastructure with apache cloud stack and riak cs redux
Better, faster, cheaper infrastructure with apache cloud stack and riak cs redux
 
Gartner evaluation criteria_for_clou_security_networking
Gartner evaluation criteria_for_clou_security_networkingGartner evaluation criteria_for_clou_security_networking
Gartner evaluation criteria_for_clou_security_networking
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
AliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core FeaturesAliCloud Object Storage Service (OSS) Core Features
AliCloud Object Storage Service (OSS) Core Features
 

Similar to When the Cloud is a Rockin: High Availability in Apache CloudStack

Openstack upgrade without_down_time_20141103r1
Openstack upgrade without_down_time_20141103r1Openstack upgrade without_down_time_20141103r1
Openstack upgrade without_down_time_20141103r1
Yankai Liu
 

Similar to When the Cloud is a Rockin: High Availability in Apache CloudStack (20)

CCCNA17 Reliable Host Fencing
CCCNA17 Reliable Host FencingCCCNA17 Reliable Host Fencing
CCCNA17 Reliable Host Fencing
 
Paul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackPaul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStack
 
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
Navigating SAP’s Integration Options (Mastering SAP Technologies 2013)
 
New stuff in CloudStack!
New stuff in CloudStack!New stuff in CloudStack!
New stuff in CloudStack!
 
CCCNA17 CloudStack Container Service
CCCNA17 CloudStack Container ServiceCCCNA17 CloudStack Container Service
CCCNA17 CloudStack Container Service
 
Apache Geode (incubating) Introduction with Docker
Apache Geode (incubating) Introduction with DockerApache Geode (incubating) Introduction with Docker
Apache Geode (incubating) Introduction with Docker
 
Spring Cloud: Why? How? What?
Spring Cloud: Why? How? What?Spring Cloud: Why? How? What?
Spring Cloud: Why? How? What?
 
AWS Sydney Summit 2013 - Architecting for High Availability
AWS Sydney Summit 2013 - Architecting for High AvailabilityAWS Sydney Summit 2013 - Architecting for High Availability
AWS Sydney Summit 2013 - Architecting for High Availability
 
Improving CloudStack for operators
Improving CloudStack for operatorsImproving CloudStack for operators
Improving CloudStack for operators
 
CCCNA17 Distributed CI and Testing for Cloudstack in a Hybrid Community
CCCNA17 Distributed CI and Testing for Cloudstack in a Hybrid CommunityCCCNA17 Distributed CI and Testing for Cloudstack in a Hybrid Community
CCCNA17 Distributed CI and Testing for Cloudstack in a Hybrid Community
 
CloudStack upgrade best practices - Dag Sonstebo
CloudStack upgrade best practices - Dag SonsteboCloudStack upgrade best practices - Dag Sonstebo
CloudStack upgrade best practices - Dag Sonstebo
 
Openstack upgrade without_down_time_20141103r1
Openstack upgrade without_down_time_20141103r1Openstack upgrade without_down_time_20141103r1
Openstack upgrade without_down_time_20141103r1
 
Failover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptxFailover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptx
 
Scala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in ScalaScala eXchange: Building robust data pipelines in Scala
Scala eXchange: Building robust data pipelines in Scala
 
CloudStack Container Service
CloudStack Container ServiceCloudStack Container Service
CloudStack Container Service
 
Dynamic roles in cloudstack
Dynamic roles in cloudstackDynamic roles in cloudstack
Dynamic roles in cloudstack
 
Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6Cross Datacenter Replication in Apache Solr 6
Cross Datacenter Replication in Apache Solr 6
 
AWS re:Invent 2016: Identifying Your Migration Options: the 6 Rs (ENT311)
AWS re:Invent 2016: Identifying Your Migration Options: the 6 Rs (ENT311)AWS re:Invent 2016: Identifying Your Migration Options: the 6 Rs (ENT311)
AWS re:Invent 2016: Identifying Your Migration Options: the 6 Rs (ENT311)
 
Elastic-Engineering
Elastic-EngineeringElastic-Engineering
Elastic-Engineering
 
5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
API Governance and Monetization - The evolution of API governance
API Governance and Monetization -  The evolution of API governanceAPI Governance and Monetization -  The evolution of API governance
API Governance and Monetization - The evolution of API governance
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

When the Cloud is a Rockin: High Availability in Apache CloudStack

  • 1. The Cloud Specialists When the Cloud is a Rockin': High Availability in Apache CloudStack shapeblue.com • @ShapeBlue John Burwell • @john_burwell VP of Software Engineering
  • 2. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue A b o u t M e • VP of Software Engineering @ ShapeBlue • Member, Apache CloudStack PMC (June 2013) • Ran operations and designed automated provisioning for analytic/virtualization clouds • Led architectural design and server-side development of a SaaS physical security platform
  • 3. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Rohit Yadav • Abhi Prateek • Murali Reddy • Boris Stoyanov T h e r e ’ s N o “ I ” i n T e a m
  • 4. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue M o t i v a t i o n Currently [sic] KVM HA works by monitoring an NFS based heartbeat file and it can often fail whenever this network share becomes slower, causing the hypervisors to reboot. … This is embarrassing. How can we fix it? Ideas, suggestions? How are other hypervisors doing it? - Nux 15 October 2015 CLOUDSTACK-8943
  • 5. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Limited to hosts and VMs using NFS storage • Tight coupling between the Agent and HighAvailabilityManager • False positives which interrupt the operation healthy resources L i m i t a t i o n s / I s s u e s Inconsistent behavior prevents operators from trusting KVM HA
  • 6. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue B u i l d v s . B u y Pros • Integration with the CloudStack control plane and abstractions • Simpler configuration • Integrated instrumentation and logging Cons • Complex mechanism to implement, test, and maintain • Foregoing a proven, battle tested implementation • Less functionality initially A robust infrastructure control plane must include the ability to recover and fence resources
  • 7. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H A R e s o u r c e M a n a g e m e n t S e r v i c e HA Resource Management Service Plugin •Manages per resource FSM •Persistence •Concurrency/Back Pressure •Common Business Logic •Resource-specific Business Logic HA Provider Resource
  • 8. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Loose coupling between resources and HA • Consolidate orthogonal HA concerns • Prove the correct operation of the HA Resource Management Service and HA Providers independently • Leverage CloudStack abstractions • Develop a model for architectural evolution G o a l s To create a trustworthy system, operational correctness must be the prevailing priority
  • 9. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Health Check: An idempotent check of a resource to directly verify its proper operation • Activity Check: An idempotent check to observe the side-effects of a resource’s proper operation • Eligibility: An idempotent determination of a resource’s eligibility for HA management • Recovery: Take potentially destructive actions to bring a resource back to a healthy state • Fence: Take potentially destructive actions to prevent an unrecoverable resource from impacting the health of its peers T e r m s a n d C o n c e p t s
  • 10. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • DISABLED: The resource is part of a partition where HA operations have been disabled or have been disabled for the resource. • INITIALIZING: The initial health and eligibility of the resource for HA management is currently being determined. • AVAILABLE: The resource is available based on the passage of the most recent health check and it containing partition has an HA state of ACTIVE. • INELIGIBLE: The resource's enclosing partition has an HA state of ACTIVE but its current state does not support HA check and/or recovery operations. • SUSPECT: The resource pending an activity check due to failing its most recent health check. • CHECKING: An activity check is currently being performed on the resource. • RECOVERING: Recovery operations are in-progress to bring the resource back to a healthy state. • DEGRADED: The resource cannot be managed by the control plane but passed its most recent activity check indicating that the resource is still servicing end-user requests • FENCED: The resource is not operating normally and automated attempts to recover it failed. Manual operator intervention is required to recover the resource. S t a t e s
  • 11. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue S t a t e M o d e l
  • 12. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H A P r o v i d e r I n t e r f a c e public interface HAProvider<R> extends Adapter { ResourceType resourceType(); ResourceSubType resourceSubType(); boolean isEligible(R r); boolean isHealthy(R r) throws HACheckerException; boolean hasActivity(R r) throws HACheckerException; boolean recover(R r) throws HARecoveryException; boolean fence(R r) throws HAFenceException; }
  • 13. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue K V M H o s t H A KVM Host HA Provider Storage Processor Activity Check Host Recover / Fence using OOBM KVM Agent Health Check
  • 14. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue C o n c u r r e n c y M o d e l •Producer/consumer model •Size bounded work queues •Time bounded operations •Fixed sized thread pools •Idempotent operations are ephemeral •Non-Idempotent operations are managed through AsyncJobManager using a new time- delayed dispatcher HA operations cannot overwhelm the control plane
  • 15. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Focused on KVM host HA • Initial implementation started — validating the design • Draft specification — functional spec will be published in the next 1-2 weeks • Robust unit and integration test model to verify both the service and KVM host HA provider • Delivery of the first version in July 2016 for inclusion in 4.10 (August 2016) S t a t u s
  • 16. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue • Support Nested HA Resources • Instrumentation • Migrate VM HA to the HA Resource Management Service W h a t ’ s N e x t
  • 17. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue Questions? Comments? #cloudstackworks

Editor's Notes

  1. Goals Provide an overview of the HA Resource Management Service which will be delivered in 4.10 Explain how developers can leverage it to add HA management for their plugins/resources Caveats 20 min talk — a high level overview of large, complex topic
  2. We started working on fixing KVM host HA, and discovered that we would build something useful for other types of resources How many people have pulled a HA system out of production because you couldn’t trust it?
  3. Considered integrating the Linux-HA project
  4. Runs side-by-side with the existing HA mechanism
  5. HA is an aspect of sorts applied to a resource Separate tables which for persistence of HA state without impacting the core schema New APIs to configure and control HA for a resource
  6. Breeze through this slide — provided for reference
  7. These transitions are completely encapsulated
  8. Implemented per resource type/plugin Dynamically discovered at startup via reflection
  9. Activity check does not require clock sync Only NFS now, but any storage provider that adds the capability will be able to leveraged
  10. Handoff of HA operations in clustered configurations