Good afternoon folks, Hope you are here for the high availability discussion.. In case of an emergency, we have specially arrange a highly available pair of exits to your left and behind ya..So, let me tell u a bit about myself and what HA means to me.. I am a product manager at RightScale..
My relationship with HA goes back all the way to my kindergarten years, growing up in India. Going to my first big kindergarten exam, I recall worrying about having more than one sharpened pencils in my pencil box ready to go. And yes, kindergarteners have exams in India, but that’s an entirely different discussion. Fast forward to my college days, taking my big 747 flight to california. Yes, you guessed it, I worried about the plane having enough engines so if one of them failed, I wouldn’t become fish food in the pacific ocean Fast forward few more years to my telecommunication days – visiting KDDI and NTT DoCoMo in Japan for discussion on our messaging product.. They pretty much immediately got to the topic of “how many 9s does your product do”? Any anything less than 5-9s would not have been an acceptable answer in the heavily regulated Japanese telecommunication market.
Fast forward to my college days, taking my first big flight on a 747 to california. Yes, you guessed it, I worried about the plane having enough engines so if one of them failed, I wouldn’t become fish food in the pacific ocean
Fast forward few more years to my telecommunication days – visiting KDDI and NTT DoCoMo in Japan for discussion on our messaging product.. They pretty much immediately got to the topic of “how many 9s does your product do”? Any anything less than 5-9s would not have been an acceptable answer in the heavily regulated Japanese telecommunication market.
Quick definition of how the “9s” availability translates to allowed downtime each year
Leap forward to 2012 – the cloud era is in full swing. Behemoth cloud providers are stamping out VMs like Oreo cookies, while preaching the mantra “everything fails all the time”.And rightfully so – In 2012, we saw 27 sizable outages in public, private, hosting and SaaS providers.Infographic -- not just restricted to cloud computing only..- 7 major cloud outages in 2012.. Average company has 1 major and 3 minor DC outages per year$5k per min of downtime (avg cost)They are starting to become more and more public as more people are getting on the cloud..-May of 2010. - first big one that happened was in- April 2011 -- lot of people that got a lot of press
Among the top-5 causes for outages were power loss, natural disasters, software bugs that cascaded and operator errors.Even though large scale outages are rare, they do happen and will continue to happen in the future.
In the aftermath of outages, you see these..Outages are expensive – there is nothing more frustrating to a modern day consumer to go to a website and see its down.. Every minute of downtime affect your revenue and your brand reputation. Computer Associates did a study last year that the cost of outages is about $26 Billion a year.Cost of
We are in the golden age of cloud computing..
At the end of the day, you are responsible for the HA of your application. Cloud infrastructure provides tools.Relying on cloud infrastructure for HA is a recipe for trouble as this locks you into that cloud infra.. You need portability, so when you move your application to another cloud, it stands on its own merit.Complexity of HA against the risk.. Auto and home insurance. The cost of HA goes up exponentially as you reduce your tolerance for downtime (Recovery time objective) as well as tolerance for data loss (Recovery Point objective).
This is what we generally recommend when someone comes to us and says I want HAThree tiered ApplicationRR DNS Load BalancersArray of Application ServersMaster – Slave DatabasesAtleast one of each component in each AZPlace slave database in different zone, so if one of the zones were to go down, you will not have an outage.. Granted there will be some performance degradation..
During emergencies, time is precious – make sure it works
If both goes down, u have no where to go..if the disaster hits management, u still have the app,if the disaster hit app u can execute on DR scenarios..
Which parts you should automate and which parts you shouldn’t..We always recommend using dynamic DNS for your DB servers.. This allows app servers to use a single FQDN that can be resolved by the dynamic DNS. So in case of a failover, Dynamic DNS gets automatically updated and the servers will discover the new DB once the TTL expires.Use low TTL(e.g: mymaster.mydomain.com)We recommend automating the process of connecting apps servers to LBs. So when a new app server fires up, it automatically registers itself to the load balancer without manual interventionThe process is automated, decision to run the process is manual.. Once u pushed that button, there is no going back, so make sure u are certain before you failover.. The promotion happened in case where the master wasn’t really down but it resulted
I AM representing RightScale today, so a little bit on how RightScale can help.Server templates allow you to pre-configure servers by starting from a base image and adding scripts that run during boot, operational and shutdown phases of a server instance.The key benefit of a server template is that they help you create a easily reproducible server setup. And this can be done across multiple clouds..Through the server configuration mechanism that is built into the server templates, they servers have the ability to automatically join load balancer pools, autoscale across zones etc.
I AM representing RightScale today, so a little bit on how RightScale can help.Server Template contains a list of multi-cloud images.. When a server is created, Quickly, efficiently and repeatably
Stacking up with OpenStack: Building for High Availability
Stacking up with OpenStack:Building for High AvailabilityUtpal Thakrar, Sr. Product ManagerApril 17, 2013
2#My relationship with HA 1975 Cloud Management #rightscale
3#My relationship with HA 1991 Cloud Management #rightscale
4#My relationship with HA 2001 How many 9-s can your product do? Cloud Management #rightscale
5#So what did they mean by 5-9s? Availability Allowed Down Time each Year 99% 3.65 days 99.9% 8.76 hours 99.99% 52.56 minutes 99.999% 5.26 minutes Cloud Management #rightscale
6#Stuff happens, are you prepared? Cloud Management #rightscale
10#Old School Fault-Tolerance: Build Two Cloud Management #rightscale
11#Golden Age of Cloud Computing No Up-Front Low Cost Pay Only for Capital Expense What You Use Self-Service Easily Scale Up Improve Agility & Infrastructure and Down Time-to-Market Deploy Cloud Management #rightscale
12#Golden Age for Fault-Tolerance No Up-Front HA Low Cost Pay for DR Only Capital Expense Backups When You Use it Self-Service Easily Deliver Fault- Improve Agility & DR Infrastructure Tolerant Applications Time-to-Recovery Deploy Cloud Management #rightscale
13#Yeah, but …What about my private cloud?Applications deployed in private clouds have to worry about:• Private Cloud Infrastructure being HA• Application architecture HA / DR• With Public Clouds – Well, you get what your provider gives you Cloud Management #rightscale
14#Private Cloud Infrastructure HASeveral single points of failure in OpenStack deployment• OpenStack API services• MySQL• RabbitMQSolved in various ways• Pacemaker cluster management• Keepalived (e.g: RAX Private Cloud)• MySQL (Galera), RabbitMQ (active-active mirrored queues) Eliminate SPoFs as best as you can. Cloud Management #rightscale
15#What about my app?Design for failure:• If your application relies on Cloud infrastructure SLA for its HA needs, you are STUCK with that vendor / infrastructure• Need to balance cost and complexity against risk tolerance• Design application so that its: Build for server failure Build for zone failure Build for cloud failure Keep management layer separate from infrastructure Cloud Management #rightscale
16#Build for Server Failure• Set up auto-scaling• Set up database mirroring, master/slave configuration• Use static public IPs• Use Dynamic DNS for private IPs Cloud Management #rightscale
17# Build for Zone Failure Static Public IPs DNS 220.127.116.11 18.104.22.168 Zone 1 Zone 2 1 LOAD BALANCERS LOAD BALANCERS Where possible, use NoSQL DB like Cassandra or MongoDB APP SERVERS AUTOSCALE MASTER DB SLAVE DB REPLICATE Block SNAPSHOTS Object storeSnapshot data volume for backups so Place Slave databases in onethe database can be readily recovered or more zones for failover. within the region. A creative deployment model would be to make your private cloud an “AZ” by placing it in close physical proximity to a public cloud provider Cloud Management #rightscale
18#Build for Cloud Failure (Cold DR)Staged Server Configuration and generally no staged data $• Not recommended if rapid recovery is required• Slow to replicate data to other cloud and bring database online DNS 22.214.171.124 Private DALLAS LOAD BALANCERS LOAD BALANCERS APP SERVERS APP SERVERS MASTER DB SLAVE DB SLAVE DB REPLICATE Block SNAPSHOTS CLOUD Cloud Management FILES #rightscale
19#Build for Cloud Failure (Warm DR)Staged Server Configuration, pre-staged data and running Slave Database Server $$• Generally recommended DR solution• Minimal additional cost and allows fairly rapid recovery DNS 126.96.36.199 Private DALLAS LOAD BALANCERS LOAD BALANCERS APP SERVERS APP SERVERS MASTER DB SLAVE DB SLAVE DB REPLICATE REPLICATE Block SNAPSHOTS SNAPSHOTS CLOUD Cloud Management FILES #rightscale
20#Build for Cloud Failure (Hot DR)Parallel Deployment with all servers running but all traffic going to primary $$$• Not recommended• Very high additional cost to allow rapid recovery DNS 188.8.131.52 Private DALLAS LOAD BALANCERS LOAD BALANCERS APP SERVERS APP SERVERS MASTER DB SLAVE DB SLAVE DB REPLICATE REPLICATE Block SNAPSHOTS SNAPSHOTS CLOUD Cloud Management FILES #rightscale
21#Availability vs. Cost - Dial Cost Availability Min Min Max Max Cloud Management #rightscale
22#Make sure workload is portable across clouds Cloud Management #rightscale
23#Automate and test everything• Automate backups of your data• Setup monitoring and alerts• Run fire-drills! Plan and Practice your recovery procedures! Cloud Management #rightscale
24#Separate Management layer from Infrastructure• Keep the keys to the car outside the car Cloud Management #rightscale
25#Automating HA and DR• Use dynamic DNS for your database servers • Allow app servers to use a single FQDN. • Use a low TTL to allow rapid failover in the case of a change in master database• Automatic connection of app servers to load balancing servers • App servers can connect to all load balancers automatically at launch • No manual intervention • No DNS modifications• Automated promotion of slave to master • Process is automated • Decision to run process is manual Cloud Management #rightscale
28#How RightScale makes it possibleRightScale ServerTemplates™• Reproducible: Predictable deployment• Dynamic: Configuration from scripts at boot time• Multi-cloud: Cloud agnostic and portable• Modular: Role and behavior abstracted from cloud infrastructure Cloud Management #rightscale
29#How RightScale makes it possibleMultiCloud Images• MultiCloud Images can be launched across regions and clouds without modification ServerTemplate contains a list 1 of MultiCloud Images (MCIs) When the Server is 2 created, a specific MCI is chosen. The appropriate 3 RightImage is used at MultiCloud Images launch. Cloud A, B, Image 1 Cloud A C, Image 2 Cloud B, Image 1 Cloud A, B, Image 1 Cloud B Stability across clouds Image 1 RightImage Cloud Management #rightscale
30#Outage-Proofing Best Practices Place in >1 Replicate data Replicate data zone: across zones across zones • Load balancers Backup across Design stateless • App servers regions & clouds apps for • Databases Monitoring, alert, resilience to Maintain and automate reboot / relaunch capacity to operations to absorb zone or speed up region failures failover Cloud Management #rightscale
31#Thank you!Sign-up for a free account at: www.rightscale.comCheck out job postings are: www.rightscale.com/jobs We are hiring! Cloud Management #rightscale