Mmckeown hadr that_conf

403 views
305 views

Published on

Presentation deck from my talk at THAT CONFERENCE on 8/12/13 on Azure HA/DR.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
403
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mmckeown hadr that_conf

  1. 1. PLATINUM SPONSORS Gold Sponsors
  2. 2. High Availability and Disaster Recovery in Windows Azure MIKE MCKEOWN BLOG: HTTP://WWW.MICHAELMCKEOWN.COM TWITTER: NWOEKCM LINKEDIN: WWW.LINKEDIN.COM/PUB/MIKE-MCKEOWN/20/B73/389/ CLOUD SOLUTIONS ARCHITECT - ADITI TECHNOLOGIES
  3. 3. CORPORATE OVERVIEW OF ADITI - TRUSTED, RESPECTED, TECHNOLOGY SERVICES LEADER 2012 Partner of the year Windows Azure , Finalist 2011 Partner of the year Windows Azure SI, Finalist 2010 Partner of the year Windows Azure , Winner  Best companies to work for  Top 10 IT Workplace  Global Cloud MVPs  Top 50 Cloud influencers  1:114 hiring ratio The Best ‘OF’ Vendor Award  52% of our customers rate us 5/5.  45 + active customers.  1200+ engagements.  1600 people, globally  18 years, 12 locations
  4. 4. You might be from Wisconsin if…  You have been both frostbitten and sunburned all in the same week.  You owe more money on your snowmobile than you do on your family car.  You consider a six pack of beer and a bug-zapper quality entertainment  You go to your family reunion looking to meet new women  You learned to drive a tractor before the training wheels were off your bike.  You think that John Deere Green, Ford Blue, and Primer Gray are the three primary colors.  Your school loses half its student body during deer season.  The blue book value of your truck goes up and down depending on how much gas it has in it.
  5. 5. Agenda  High Availability (HA) and Disaster Recovery (DR)  Definitions  Service Level Agreements (SLAs)  Designing for Failure  HA/DR Architectures  Failover Demo – Azure Traffic Manager  Tips and Best Practices
  6. 6. Introduction to HA/DR  High Availability (HA) includes a Disaster Recovery (DR) plan  Cloud failure is inevitable  Proper management means fast recognition to minimize effects  Define tolerance thresholds and an associated strategy  Consider budget and strategic location of resources  Cloud provides affordable and easily configurable geo-redundancy  Azure builds resiliency into some of its services  Others you must build it in yourself
  7. 7. What is your Cloud HA/DR strategy?
  8. 8. 1. HA = Flat tire and spare donut tire With spare tire car continues to run  Can’t reach top speeds  Can’t maneuver as well Example of Azure HA:  An instance of a Web role crashes due to a fault on its rack  SLA allows app to keep running
  9. 9. High Availability Definitions 1. Fault Tolerance  Detects and maneuvers around failed elements to continue and return the correct results within specific timeframe  Use one or more design strategies - app redundancy, data replication, or degraded functionality (i.e. order processing system) 2. Availability  HA systems are measured by the % of their availability in terms of planned/unplanned service outages for users  Azure Availability SLA  Techniques can improve availability so its always available during problems  Redundant and reliable design
  10. 10. Redundancy in Windows Azure • Windows Azure Storage with 2x replicas • Azure SQL Database built-in 2x backup servers • Windows Azure Caching with high availability enabled • Multi-instance Windows Azure Web Sites and Cloud Services • Failover with Windows Azure Traffic Manager
  11. 11. Reliability in Windows Azure • Auto recovery of crashed/nonresponsive instances • Fault domain to scatter instances across racks • Virtual machine availability set to allocate VMs across Fault domains • Upgrade domain to avoid shutting down all instances at the same time • Handle transient errors using the Transient Fault Handling Application block
  12. 12. High Availability Definitions 1. Fault Tolerance  Detects and maneuvers around failed elements to continue and return the correct results within specific timeframe  Use one or more design strategies - app redundancy, data replication, or degraded functionality (i.e. order processing system) 2. Availability  HA systems are measured by the % of their availability in terms of planned/unplanned service outages for users  Azure Availability SLA  Techniques can improve availability so its always available during problems  Redundant and reliable design 3. Scalability  Meet increased demand with consistent results in acceptable time windows  Horizontal scale out (dynamic) vs vertical scale up (restart)
  13. 13. Scalability
  14. 14. What does HA require?  Strategies to absorb outage of key components  No single points of failure  Multiple web servers and data replication  Graceful failover when individual components fail (and they will)  Backup components and systems XXX
  15. 15. 2. DR = Bad Car Crash  Entire Data center down and no connection to the database  Network goes down and can’t contact to on prem machines
  16. 16. Disaster Recovery  Process, policies, and procedures to restore critical systems after a catastrophic event  Application failure, data corruption (human error also), network down, failure of connected service, DC down  A DR Plan is a part of a good HA strategy  Invest time and resources to continually plan, prepare, rehearse, document, train, and update processes  One point of responsibility  Real World DR Plan – Dilbert Technical Services  Establish RPO and RTO and know your SLAs
  17. 17. Recovery Point Objective (RPO) Disaster How much data can you lose and still be okay after rollback? How consistent does data need to be after a rollback? > RPO means less critical/$ < RPO means more critical/$
  18. 18. Recovery Time Objective (RTO) Disaster RTO How much time does it take to recover? > RTO means less critical/$ < RTO means more critical/$
  19. 19. What’s in a Hot Dog?  Animal organs  Kindey, liver, hearts, etc.  Reproductive organs?  Plastic, glass, bugs, and animal bones  Mechanically Separated Meats  “A paste-like meat product produced by forcing bones, with attached edible meat, under high pressure through a sieve to separate the bone from the edible meat tissue,"  SLAs are like hot dogs
  20. 20.  The closer to a 10 (more 9’s) the more up time but costs more and higher maintenance  Azure has non-cumulative monthly SLAs Service Level Agreements
  21. 21. Compounding of SLAs Effective availability - Considers the SLAs of each dependent service and their cumulative effect on the total system availability  Windows Azure Compute (2 instances) = 99.95%  SQL Azure Database = 99.9%  Windows Azure Storage = 99.9%  Total Monthly SLA  4.38 hours + 8.76 hours + 8.76 hours = 21.9 hours  Effective Availability: 99.75%  Is the good enough for your app?  Can Effective availability of SLAs meet RPO and RTO of your app?
  22. 22. SLA – Downtime vs Costs
  23. 23. Azure HA/DR Architecture Concepts  Failure Design  Multi-Site Data Backup/Recovery Strategies  Immediately or eventually consistent systems  FC and Fault Domains  PaaS and IaaS  Windows Azure Traffic Manager
  24. 24. Design For Failure  Large scale failures in any Cloud are rare but will happen  Cloud Data Centers don’t magically remove failures  Fabric Controller helps to quickly recover from problems in one DC  Understand RPO/RTO requirements to design for failures  Balance cost and complexity of HADR efforts against risk(s) you’re willing to bear  Cloud has made DR and HA remarkably easy and affordable  Multiple configurations possible with a few clicks  Application owners are ultimately responsible for failure management  Owners of DR Plans and HA strategy
  25. 25. Multi-Site Data Recovery Approaches 1. Azure Data Synch Services (PaaS)  Recommended between Azure SQL Database instances only  5 minutes minimum replication  If need lower RPO need to do it yourself  Creates clutter in synced databases 2. SQL Server Merge Replication (IaaS)  Two SQL Server databases (IaaS VMs) in two different regions  Update is DB A goes to DB B also and vice versa  Synchronous transactional operations locks tables and affects performance 3. SQL Server 2012 Always-On Availability Groups (IaaS)  Two SQL Server databases (IaaS VMs) in different regions  Immediate replication in master and its replicas  Non-transactional so no locking or performance degradation
  26. 26. 1. Azure Data Sync Services SQL Azure Database only (pure PaaS)  5 minute minimum replication  Transactional and blocking  One way or two way  Not recommended with SQL Server Azure SQL Database Azure SQL DatabaseData Sync Services
  27. 27. 2. SQL Server Merge Replication/Azure IaaS VMs  Two databases in two different Regions in IaaS VMs  Update is DB A goes to DB B …..and vice versa  Synchronous transactional operations locks tables and affects performance Azure IaaS VM and SQL Server 1 Azure IaaS VM and SQL Server 2 SQL Server Database A SQL Server Database B Trans Sync from B to A Trans Sync from A to B
  28. 28. 3. SQL Server 2012 Always-On Availability Groups  Two databases in two different Regions in IaaS VMs  Immediate replication in master and its replicas  Non-transactional so no locking or performance degradation Azure IaaS VM and SQL Server 2012 Azure IaaS VM and SQL Server 2012 SQL Server 2012 SQL Server 2012 Master DB Replica DB Always On (Non- Blocking) Synchronization
  29. 29. Consistency Models  Immediately consistent systems  Traditional Synchronous pattern of all at once  Can hurt performance with locking/blocking  Possibly lose something at failure and recovery  The “C” in ACID  Transactional consistency to all affected data based upon rules, triggers, constraints  Eventually consistent systems  Asynchronous patterns using durable queues  Nothing lost in recovery  The ability to recreate system after failure  Improves fault tolerance in systems  Customer may not need to see immediate updates  Posts to Twitter/Facebook  DB may have some inconsistencies at any point in time  All nodes eventually consistent when all updates are done  Both have a role in HADR based upon RTO and RPO
  30. 30. “A fault domain is a set of hardware components – computers, switches, and more – that share a single point of failure.”  Cant control FDs – given by Azure  Fault Domains do not span data centers  FC provisions multiple role instances across Fault Domains  FC monitors Fault Domains to reduce localized failures  Upon failure FC enforces SLA and re-provisions instances Fault Domains - PaaS
  31. 31. “A fault domain is a set of hardware components – computers, switches, and more – that share a single point of failure.”  VM Availability Sets  Different Fault Domains/Racks  Azure locates VMs in different fault domains to prevent localized failure  Required for 99.95% VM SLA  Ex. Web & SQL Server Fault Domains – IaaS Virtual Machines
  32. 32. Windows Azure Traffic Manager (WATM)  Automated priority of routing 1. Failover 2. Performance 3. Round-robin  Gives a new DNS prefix for users  Key point – You decide if your failover domain is dormant or active while NOT in failover mode  WATM rolls over regardless if site is up or down  You need to manage if failover domain is active or dormant
  33. 33. HA/DR Cloud Architectures
  34. 34. HA/DR Types and Terms  Mostly PaaS concepts with a bit of IaaS  Example : home phone 1. Cold  Backup has nothing active, pre-loaded, or updated  Least expensive and slowest recovery  Ex. Have to go out and buy new home phone 2. Warm or Passive  Backup has some parts loaded/current and others made active upon failure  Ex. Home phone at house but still packed and notcharged 3. Hot or Active  Backup is loaded and ready to receive load upon failure but not active  Ex. Home phone with charged battery but not plugged into home circuit 4. Highly Available  Backup is loaded and active and receiving load as part of normal processing  Most expensive and quickest recovery  Ex. Home phone with charged battery and plugged into home phone circuit
  35. 35. RTO vs. Cost
  36. 36. Single Region Deployment • • • • •
  37. 37. Cold DR • • • • • • • • •
  38. 38. Fault Domain #1 Fault Domain #2 Fault Domain #1 Fault Domain #2 Warm DR Fault Domain #1 Fault Domain #2
  39. 39. Hot DR – Option 1
  40. 40. Fault Domain #1 Fault Domain #2 Fault Domain #1 Fault Domain #2 Hot DR – Option 2 Fault Domain #1 Fault Domain #2
  41. 41. Fault Domain #1 Fault Domain #2 High Availability Fault Domain #1 Fault Domain #2
  42. 42. Demo: HA using Azure Traffic Manager
  43. 43. HA/DR Checklist for Risk Mitigation 1. Conduct a risk assessment for each application  Each can have different requirements.  Some applications are more critical than others  Justify extra cost to architect them for disaster recovery  Use this information to define the RTO and RPO for each application. 2. Design for failure starting with the application architecture 3. Implement best practices for high availability  Balancing cost, complexity, and risk 4. Implement disaster recovery plans and processes. 5. Establish backup strategies for all reference and transactional data. 6. Consider failures that span the module level all the way to a complete Cloud outage. 7. Choose a multi-site disaster recovery architecture.
  44. 44. General HA Best Practices  Avoid single points of failure  Always place (at least) one of each component (load balancers, app servers, databases, …) in at least two regions or fault domains  Maintain sufficient capacity to absorb region/ fault domain failures  Reserved Instances (hot) – guarantee capacity is available in a separate region/cloud  Replicate data across clouds/regions for failover  Setup monitoring, alerts, and operations to identity and automate problem resolution or failover process  Design stateless applications for resilience to reboot / relaunch
  45. 45. Summary Plan and design for failure Work with business and IT - RPO and RTO Understand cumulative SLAs Implement correct HA/DR Architectures Best Practices and Checklist Start with some DR strategy and improve continually
  46. 46. Resources  Disaster Recovery and High Availability for Windows Azure Applications  Mike McKeown and Hanu Kommalapati http://msdn.microsoft.com/en-us/library/dn251004.aspx  Contingency Planning Guide for Information Technology Systems  National Institute of Standards and Technology https://www.fismacenter.com/sp800-34.pdf  Failsafe: Guidance for Resilient Cloud Architectures  Marc Mercuri, Ulrich Homann, and Andrew Townhill http://msdn.microsoft.com/en-us/library/windowsazure/jj853352.aspx  Business Continuity for Windows Azure  Patrick Wickline, Adam Skewgar, Walter Myers III http://msdn.microsoft.com/en-us/library/windowsazure/hh873027.aspx
  47. 47. Questions?
  48. 48. AUGUST 11TH – 13TH 2014 SAME PLACE, SAME TIME

×