SlideShare a Scribd company logo
1 of 25
Download to read offline
The Cloud Specialists
Reliable	Host	Fencing	In	CloudStack
Rohit	Yadav	(Software	Architect)
Boris	Stoyanov (Sr.	Software	Test	Engineer)
rohit.yadav@shapeblue.com
boris.stoyanov@shapeblue.com
@rhtyd /	@bsstoyanov
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
A b o u t M e
Rohit Yadav
• Software Architect @
ShapeBlue
• Contributor and Committer
since 2012
• Author and maintainer of
CloudMonkey
Boris Stoyanov
• Senior Software Engineer Test
@ ShapeBlue
• Contributor since 2016
C l i c k t o e d i t
The Cloud Specialists ShapeBlue.com @ShapeBlue
“ShapeBlue are expert builders of public & private
clouds. They are the leading global CloudStack
services company.”
A b o u t S h a p e B l u e
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
S h a p e B l u e c u s t o m e r s
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
S h a p e B l u e c u s t o m e r s
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
S h a p e B l u e c u s t o m e r s
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
W h a t i s H A ?
High availability is a characteristic of a system, which
aims to ensure an agreed level of operational
performance, usually uptime, for a higher than normal
period.[source: wikipedia]
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H A i n C l o u d S t a c k : S t a t u s Q u o
• Currently HA is only supported for VMs by
CloudStack.
• VM HA mechanism works for VMs that are marked
HA.
• Implementation tied to VM as a first class resource,
asynchronously scheduled, limited to VM
investigation/fencing/restart on new host.
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H A i n P r o d u c t i o n : S t a t u s Q u o
• Investigations are VM centric and not host centric.
• Limited fencing of host, highly unreliable.
• VM HA may end up starting VMs on another host, while the
VMs may be running on the faulty. Large environments see
corrupt VMs and disks.
• Unchecked faulty hosts and faulty neighbors, with no
automatic-recovery.
• Real world issues seen in a very large KVM environment.
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
A t t e m p t e d S o l u t i o n s : K V M
• Check VM for disk activities based on a
timeout/threshold before re/starting VM.
• (Wall) Clocks are not reliable
• Maintenance and management issues
• No recovery mechanism, fencing still remains
unreliable
References:
https://issues.apache.org/jira/browse/CLOUDSTACK-8762
https://github.com/apache/cloudstack/pull/753
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
L o n g Te r m S o l u t i o n ?
• CloudStack needs a way to perform power
management tasks for hosts
• Solve issues of corrupt disks due to VM HA and
unreliable host fencing
• Improve experience for admins: granular
configuration, feature kill-switch, maintenance,
management, reporting, alerts, investigations, reliable
fencing and recovery etc.
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t P o w e r M a n a g e m e n t f o r C l o u d S t a c k
• Implemented a pluggable out-of-band management framework
for CloudStack
• Granular configuration per host, kill switch at
zone/cluster/host level
• Default plugin for IPMI 2.0 compliant hosts to support power
operations: on, off, reboot, shutdown, status etc.
• High quality tests, end-to-end testing based on ipmisim
• DIY oobm plugin
Reference:
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Out-of-band+Management+for+CloudStack
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
B u i l d i n g B l o c k s f o r H o s t H A
• Solve reliably fence/recover a host: use the new shiny out-of-
band management subsystem
• What's missing:
• Granular HA configuration
• Host HA kill-switch: at zone/cluster/host level
• Tuning: Threshold based investigation, activity checks,
timeouts etc.
• Task/Load management, circuit breakers, constraint based
state transitions and operations
Reference:
https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
R e t h i n k H A
• CloudStack organization units as partitions: Zone, Pod, Cluster, Host, VM.
• Separate policy from mechanism:
Implement framework/managers to enforce policies, have plugins to carry
out mechanisms
• Define HA for a general resource, pluggable HA provider implementations.
• Operational simplicity.
• Granular configuration, kill-switch at zone/cluster/host level. Disabled
by default.
• Threshold based investigations, checking, fencing and recovery.
• Leverage existing abstractions.
• Integrated resource management.
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : D e s i g n a n d I m p l e m e n t a t i o n
• HA Resource Management Service
• HA resource lifecycle management
• HA resource type agnostic
• Disabled by default, granular configurations, zone/cluster/host kill-
switch, tuning
• HA Provider
• Resource specific HA plugin
• Defines partition and resource type
• DIY HA provider for partition: host/hypervisor/etc
• One HA provider per resource type, per partition
Reference:
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : F S M S t a t e s E x p l a i n e d
• HA Resource FSM States
• Available
• Suspect
• Checking
• Degraded
• Recovering, Recovered
• Fencing, Fenced
• Disabled
• Ineligible
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : F S M S t a t e Tr a n s i t i o n s
Reference: ​
https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : L i f e c y c l e m a n a g e m e n t
• Granular HA configuration
• Kill switch: enable/disable for a partition (zone/cluster/host)
• HA validation and ownership management
• New Background Polling Manager for executor service management
• Tasks executor, bounded (ephemeral) queue management
• HA Polling tasks: Health Checks, Activity Checks, Recovery Task and
Fence Task
• FSM transitions based on task execution result
• HA resource counter management: track investigation rounds, thresholds,
timestamps, recover/fence operations
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : K V M H A P r o v i d e r
• STONITH (Shoot The Other Node In The Head) fencing model
• Activity check operations, checks for disk access activities on NFS storage
• Configurable activity check interval and activity checks
• Tunable timeouts and thresholds
• Request-reply model to check activity checks via adjacent eligible and healthy
host(s)
• Uses out-of-band management subsystem to carry out recover and fence operations
• Recovery is attempted before fencing of the host
• Alerting and reporting of operations
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : V M H A – H A P r o v i d e r C o o r d i n a t i o n
• Remaps VM-HA host state returned
to VM-HA framework based on
Host HA states, only for hosts
with Host HA enabled.
• For Host HA to work effectively,
existing VM HA framework to
work in tandem with Host HA.
• By default Host HA is disabled, no
explicit configuration changes
needed for existing users pre/post
upgrade.
• Currently, done for KVM
HAProvider
Host HA
state (KVM)
VM-HA host state
returned
Available Up
Suspect/Checking Up (Investigating)
Degraded Alert
Recovering/Recove
red/Fencing
Disconnected
Fenced Down
Ineligible/Disabled --
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : Te s t i n g w i t h S i m u l a t o r H A P r o v i d e r
• HA Provider for Simulator provides means and instrumentation to perform end-to-
end deterministic testing of the framework.
• Provides means of validation of the feature and shows pluggability of the
framework.
• New Simulator APIs provides means of validating FSM sequences and instrumenting
internal data structures.
• Marvin based integration test, covers FSM transitions, HA operations, validations,
configurations, HA ownership.
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : Te s t i n g i n n e s t e d C l o u d S t a c k e n v i r o n m e n t
• Recently, nested CloudStack environments such as Trillian, Bubble etc have
tremendously helped with QA efforts. In such environments, hypervisor hosts are
VMs in another CloudStack environments.
• As part of the FR, we've implemented a new out-of-band management plugin for
nested CloudStack environment.
• This plugin can perform power management operations to start/stop/reboot the host
VMs.
• The new oobm plugin allows for scalability and load testing of the Host HA feature
in nested CloudStack environment. Currently being tested for a large KVM based
environment.
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : C u r r e n t S t a t e & F u t u r e P l a n s
• Pull request: https://github.com/apache/cloudstack/pull/1960
• FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
• Currently supports two HA Provider implementations:
• KVM: Out-of-band management, NFS supported
• Simulator: QA/testing
• Available out-of-band management plugins: ipmitool and nested-cloudstack
• Likely available in Apache CloudStack 4.11 or above
• Future Plans:
• Multiple HA Provider implementations for other hypervisors, support for other
storage
• Scope for extension to support HA for other resources/partitions
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
H o s t H A : T h a n k s & C r e d i t s
• Abhinandan Prateek: KVM HA Provider implementation
• Boris Stoyanov: Reviews and QA
• Ilya Musayev, Marcus Sorensen and John Burwell: Requirements, feedback and
design
• Rohit Yadav: Overall design and implementation
• Team ShapeBlue, Paul, Dag, Daan – Reviews, discussions, testing, Trillian setups
C l i c k t o e d i t
The Cloud Specialists
ShapeBlue.com @ShapeBlue
Q & A
• Comments, questions welcome!
• Discuss on dev ML or on the PR.

More Related Content

What's hot

What's hot (20)

Paul Angus - CloudStack Backup and Recovery Framework
Paul Angus - CloudStack Backup and Recovery FrameworkPaul Angus - CloudStack Backup and Recovery Framework
Paul Angus - CloudStack Backup and Recovery Framework
 
CloudStack and NFV
CloudStack and NFVCloudStack and NFV
CloudStack and NFV
 
CloudStack UI
CloudStack UICloudStack UI
CloudStack UI
 
CloudStack EU user group - Trillian
CloudStack EU user group - TrillianCloudStack EU user group - Trillian
CloudStack EU user group - Trillian
 
Paul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStackPaul Angus – Backup & Recovery in CloudStack
Paul Angus – Backup & Recovery in CloudStack
 
New stuff in CloudStack!
New stuff in CloudStack!New stuff in CloudStack!
New stuff in CloudStack!
 
Giles Sirett: Introduction and CloudStack news
Giles Sirett: Introduction and CloudStack news   Giles Sirett: Introduction and CloudStack news
Giles Sirett: Introduction and CloudStack news
 
CloudStack usage service
CloudStack usage serviceCloudStack usage service
CloudStack usage service
 
Boris Stoyanov - some new features in Apache cloudStack
Boris Stoyanov - some new features in Apache cloudStackBoris Stoyanov - some new features in Apache cloudStack
Boris Stoyanov - some new features in Apache cloudStack
 
ApacheCon Miami / CCCNA17 Using KVM in CloudStack
ApacheCon Miami / CCCNA17 Using KVM in CloudStackApacheCon Miami / CCCNA17 Using KVM in CloudStack
ApacheCon Miami / CCCNA17 Using KVM in CloudStack
 
TechUG Glasgow talk 22/Feb/17 Configuration Management Best Practices
TechUG Glasgow talk 22/Feb/17 Configuration Management Best PracticesTechUG Glasgow talk 22/Feb/17 Configuration Management Best Practices
TechUG Glasgow talk 22/Feb/17 Configuration Management Best Practices
 
What’s New in CloudStack 4.15 - CloudStack European User Group Virtual, May 2021
What’s New in CloudStack 4.15 - CloudStack European User Group Virtual, May 2021What’s New in CloudStack 4.15 - CloudStack European User Group Virtual, May 2021
What’s New in CloudStack 4.15 - CloudStack European User Group Virtual, May 2021
 
ApacheCon Miami / CCCNA17 CloudStack upgrade best practices
ApacheCon Miami / CCCNA17 CloudStack upgrade best practicesApacheCon Miami / CCCNA17 CloudStack upgrade best practices
ApacheCon Miami / CCCNA17 CloudStack upgrade best practices
 
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
 Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E... Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
Designing Lean CloudStack Environments for the Edge - IndiQus - CloudStack E...
 
Network Functions Virtualization and CloudStack
Network Functions Virtualization and CloudStackNetwork Functions Virtualization and CloudStack
Network Functions Virtualization and CloudStack
 
Cloudstack container service
Cloudstack container serviceCloudstack container service
Cloudstack container service
 
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
KVM High Availability Regardless of Storage - Gabriel Brascher, VP of Apache ...
 
CSEUG introduction
CSEUG introductionCSEUG introduction
CSEUG introduction
 
Using the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackUsing the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStack
 
CloudStack - Apache's best kept secret
CloudStack - Apache's best kept secretCloudStack - Apache's best kept secret
CloudStack - Apache's best kept secret
 

Similar to CCCNA17 Reliable Host Fencing

Similar to CCCNA17 Reliable Host Fencing (20)

When the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStackWhen the Cloud is a Rockin: High Availability in Apache CloudStack
When the Cloud is a Rockin: High Availability in Apache CloudStack
 
CloudStack upgrade best practices - Dag Sonstebo
CloudStack upgrade best practices - Dag SonsteboCloudStack upgrade best practices - Dag Sonstebo
CloudStack upgrade best practices - Dag Sonstebo
 
Improving CloudStack for operators
Improving CloudStack for operatorsImproving CloudStack for operators
Improving CloudStack for operators
 
Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11Paul Angus - what's new in ACS 4.11
Paul Angus - what's new in ACS 4.11
 
Whats new in Cloudstack 4.11 - behind the headlines
Whats new in Cloudstack 4.11 - behind the headlinesWhats new in Cloudstack 4.11 - behind the headlines
Whats new in Cloudstack 4.11 - behind the headlines
 
CloudStack Container Service
CloudStack Container ServiceCloudStack Container Service
CloudStack Container Service
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
Paul Angus - CloudStack Container Service
Paul  Angus - CloudStack Container ServicePaul  Angus - CloudStack Container Service
Paul Angus - CloudStack Container Service
 
CloudStack - Top 5 Technical Issues and Troubleshooting
CloudStack - Top 5 Technical Issues and TroubleshootingCloudStack - Top 5 Technical Issues and Troubleshooting
CloudStack - Top 5 Technical Issues and Troubleshooting
 
VMworld Europe 2014: Taking Reporting and Command Line Automation to the Next...
VMworld Europe 2014: Taking Reporting and Command Line Automation to the Next...VMworld Europe 2014: Taking Reporting and Command Line Automation to the Next...
VMworld Europe 2014: Taking Reporting and Command Line Automation to the Next...
 
Failover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptxFailover-Apachecon-Asia-2022.pptx
Failover-Apachecon-Asia-2022.pptx
 
9th docker meetup 2016.07.13
9th docker meetup 2016.07.139th docker meetup 2016.07.13
9th docker meetup 2016.07.13
 
HPC Controls Future
HPC Controls FutureHPC Controls Future
HPC Controls Future
 
Simplifying Hyper-V Management for VMware Administrators
Simplifying Hyper-V Management for VMware AdministratorsSimplifying Hyper-V Management for VMware Administrators
Simplifying Hyper-V Management for VMware Administrators
 
Saltconf16 william-cannon b
Saltconf16 william-cannon bSaltconf16 william-cannon b
Saltconf16 william-cannon b
 
Cinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshotsCinder enhancements-for-replication-using-stateless-snapshots
Cinder enhancements-for-replication-using-stateless-snapshots
 
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive
Accelerate with ibm storage  ibm spectrum virtualize hyper swap deep diveAccelerate with ibm storage  ibm spectrum virtualize hyper swap deep dive
Accelerate with ibm storage ibm spectrum virtualize hyper swap deep dive
 
Apache Kafka® at Dropbox
Apache Kafka® at DropboxApache Kafka® at Dropbox
Apache Kafka® at Dropbox
 
CloudStack Container Service
CloudStack Container ServiceCloudStack Container Service
CloudStack Container Service
 

More from ShapeBlue

More from ShapeBlue (20)

CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlueCloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
CloudStack Authentication Methods – Harikrishna Patnala, ShapeBlue
 
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlueCloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
CloudStack Tooling Ecosystem – Kiran Chavala, ShapeBlue
 
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
Elevating Cloud Infrastructure with Object Storage, DRS, VM Scheduling, and D...
 
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlueVM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
VM Migration from VMware to CloudStack and KVM – Suresh Anaparti, ShapeBlue
 
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHubHow We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
How We Grew Up with CloudStack and its Journey – Dilip Singh, DataHub
 
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
What’s New in CloudStack 4.19, Abhishek Kumar, Release Manager Apache CloudSt...
 
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
CloudStack 101: The Best Way to Build Your Private Cloud – Rohit Yadav, VP Ap...
 
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIOHow We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
How We Use CloudStack to Provide Managed Hosting - Swen Brüseke - proIO
 
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
 
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
 
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.OnlineKVM Security Groups Under the Hood - Wido den Hollander - Your.Online
KVM Security Groups Under the Hood - Wido den Hollander - Your.Online
 
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
How to Re-use Old Hardware with CloudStack. Saving Money and the Environment ...
 
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
Use Existing Assets to Build a Powerful In-house Cloud Solution - Magali Perv...
 
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
Import Export Virtual Machine for KVM Hypervisor - Ayush Pandey - University ...
 
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
 
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
Mitigating Common CloudStack Instance Deployment Failures - Jithin Raju - Sha...
 
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlueElevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
Elevating Privacy and Security in CloudStack - Boris Stoyanov - ShapeBlue
 
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
Transitioning from VMware vCloud to Apache CloudStack: A Path to Profitabilit...
 
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
Hypervisor Agnostic DRS in CloudStack - Brief overview & demo - Vishesh Jinda...
 
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

CCCNA17 Reliable Host Fencing

  • 1. The Cloud Specialists Reliable Host Fencing In CloudStack Rohit Yadav (Software Architect) Boris Stoyanov (Sr. Software Test Engineer) rohit.yadav@shapeblue.com boris.stoyanov@shapeblue.com @rhtyd / @bsstoyanov
  • 2. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue A b o u t M e Rohit Yadav • Software Architect @ ShapeBlue • Contributor and Committer since 2012 • Author and maintainer of CloudMonkey Boris Stoyanov • Senior Software Engineer Test @ ShapeBlue • Contributor since 2016
  • 3. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue “ShapeBlue are expert builders of public & private clouds. They are the leading global CloudStack services company.” A b o u t S h a p e B l u e
  • 4. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue S h a p e B l u e c u s t o m e r s
  • 5. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue S h a p e B l u e c u s t o m e r s
  • 6. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue S h a p e B l u e c u s t o m e r s
  • 7. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue W h a t i s H A ? High availability is a characteristic of a system, which aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period.[source: wikipedia]
  • 8. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H A i n C l o u d S t a c k : S t a t u s Q u o • Currently HA is only supported for VMs by CloudStack. • VM HA mechanism works for VMs that are marked HA. • Implementation tied to VM as a first class resource, asynchronously scheduled, limited to VM investigation/fencing/restart on new host.
  • 9. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H A i n P r o d u c t i o n : S t a t u s Q u o • Investigations are VM centric and not host centric. • Limited fencing of host, highly unreliable. • VM HA may end up starting VMs on another host, while the VMs may be running on the faulty. Large environments see corrupt VMs and disks. • Unchecked faulty hosts and faulty neighbors, with no automatic-recovery. • Real world issues seen in a very large KVM environment.
  • 10. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue A t t e m p t e d S o l u t i o n s : K V M • Check VM for disk activities based on a timeout/threshold before re/starting VM. • (Wall) Clocks are not reliable • Maintenance and management issues • No recovery mechanism, fencing still remains unreliable References: https://issues.apache.org/jira/browse/CLOUDSTACK-8762 https://github.com/apache/cloudstack/pull/753
  • 11. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue L o n g Te r m S o l u t i o n ? • CloudStack needs a way to perform power management tasks for hosts • Solve issues of corrupt disks due to VM HA and unreliable host fencing • Improve experience for admins: granular configuration, feature kill-switch, maintenance, management, reporting, alerts, investigations, reliable fencing and recovery etc.
  • 12. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t P o w e r M a n a g e m e n t f o r C l o u d S t a c k • Implemented a pluggable out-of-band management framework for CloudStack • Granular configuration per host, kill switch at zone/cluster/host level • Default plugin for IPMI 2.0 compliant hosts to support power operations: on, off, reboot, shutdown, status etc. • High quality tests, end-to-end testing based on ipmisim • DIY oobm plugin Reference: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Out-of-band+Management+for+CloudStack
  • 13. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue B u i l d i n g B l o c k s f o r H o s t H A • Solve reliably fence/recover a host: use the new shiny out-of- band management subsystem • What's missing: • Granular HA configuration • Host HA kill-switch: at zone/cluster/host level • Tuning: Threshold based investigation, activity checks, timeouts etc. • Task/Load management, circuit breakers, constraint based state transitions and operations Reference: https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+HA+with+IPMI+Fencing
  • 14. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue R e t h i n k H A • CloudStack organization units as partitions: Zone, Pod, Cluster, Host, VM. • Separate policy from mechanism: Implement framework/managers to enforce policies, have plugins to carry out mechanisms • Define HA for a general resource, pluggable HA provider implementations. • Operational simplicity. • Granular configuration, kill-switch at zone/cluster/host level. Disabled by default. • Threshold based investigations, checking, fencing and recovery. • Leverage existing abstractions. • Integrated resource management.
  • 15. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : D e s i g n a n d I m p l e m e n t a t i o n • HA Resource Management Service • HA resource lifecycle management • HA resource type agnostic • Disabled by default, granular configurations, zone/cluster/host kill- switch, tuning • HA Provider • Resource specific HA plugin • Defines partition and resource type • DIY HA provider for partition: host/hypervisor/etc • One HA provider per resource type, per partition Reference: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
  • 16. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : F S M S t a t e s E x p l a i n e d • HA Resource FSM States • Available • Suspect • Checking • Degraded • Recovering, Recovered • Fencing, Fenced • Disabled • Ineligible
  • 17. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : F S M S t a t e Tr a n s i t i o n s Reference: ​ https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA
  • 18. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : L i f e c y c l e m a n a g e m e n t • Granular HA configuration • Kill switch: enable/disable for a partition (zone/cluster/host) • HA validation and ownership management • New Background Polling Manager for executor service management • Tasks executor, bounded (ephemeral) queue management • HA Polling tasks: Health Checks, Activity Checks, Recovery Task and Fence Task • FSM transitions based on task execution result • HA resource counter management: track investigation rounds, thresholds, timestamps, recover/fence operations
  • 19. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : K V M H A P r o v i d e r • STONITH (Shoot The Other Node In The Head) fencing model • Activity check operations, checks for disk access activities on NFS storage • Configurable activity check interval and activity checks • Tunable timeouts and thresholds • Request-reply model to check activity checks via adjacent eligible and healthy host(s) • Uses out-of-band management subsystem to carry out recover and fence operations • Recovery is attempted before fencing of the host • Alerting and reporting of operations
  • 20. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : V M H A – H A P r o v i d e r C o o r d i n a t i o n • Remaps VM-HA host state returned to VM-HA framework based on Host HA states, only for hosts with Host HA enabled. • For Host HA to work effectively, existing VM HA framework to work in tandem with Host HA. • By default Host HA is disabled, no explicit configuration changes needed for existing users pre/post upgrade. • Currently, done for KVM HAProvider Host HA state (KVM) VM-HA host state returned Available Up Suspect/Checking Up (Investigating) Degraded Alert Recovering/Recove red/Fencing Disconnected Fenced Down Ineligible/Disabled --
  • 21. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : Te s t i n g w i t h S i m u l a t o r H A P r o v i d e r • HA Provider for Simulator provides means and instrumentation to perform end-to- end deterministic testing of the framework. • Provides means of validation of the feature and shows pluggability of the framework. • New Simulator APIs provides means of validating FSM sequences and instrumenting internal data structures. • Marvin based integration test, covers FSM transitions, HA operations, validations, configurations, HA ownership.
  • 22. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : Te s t i n g i n n e s t e d C l o u d S t a c k e n v i r o n m e n t • Recently, nested CloudStack environments such as Trillian, Bubble etc have tremendously helped with QA efforts. In such environments, hypervisor hosts are VMs in another CloudStack environments. • As part of the FR, we've implemented a new out-of-band management plugin for nested CloudStack environment. • This plugin can perform power management operations to start/stop/reboot the host VMs. • The new oobm plugin allows for scalability and load testing of the Host HA feature in nested CloudStack environment. Currently being tested for a large KVM based environment.
  • 23. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : C u r r e n t S t a t e & F u t u r e P l a n s • Pull request: https://github.com/apache/cloudstack/pull/1960 • FS: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Host+HA • Currently supports two HA Provider implementations: • KVM: Out-of-band management, NFS supported • Simulator: QA/testing • Available out-of-band management plugins: ipmitool and nested-cloudstack • Likely available in Apache CloudStack 4.11 or above • Future Plans: • Multiple HA Provider implementations for other hypervisors, support for other storage • Scope for extension to support HA for other resources/partitions
  • 24. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue H o s t H A : T h a n k s & C r e d i t s • Abhinandan Prateek: KVM HA Provider implementation • Boris Stoyanov: Reviews and QA • Ilya Musayev, Marcus Sorensen and John Burwell: Requirements, feedback and design • Rohit Yadav: Overall design and implementation • Team ShapeBlue, Paul, Dag, Daan – Reviews, discussions, testing, Trillian setups
  • 25. C l i c k t o e d i t The Cloud Specialists ShapeBlue.com @ShapeBlue Q & A • Comments, questions welcome! • Discuss on dev ML or on the PR.