• Save
Operating a Highly Available Cloud Service
Upcoming SlideShare
Loading in...5

Operating a Highly Available Cloud Service



Operating a highly available cloud service is not just about technology and architecture. It has a lot to do with people and processes. Everything fails all the time. So, how do you ensure you have ...

Operating a highly available cloud service is not just about technology and architecture. It has a lot to do with people and processes. Everything fails all the time. So, how do you ensure you have the right people and the right processes in the right places to run a highly available web service. This talk covers people, processes and technology and tools required to run a highly available web service.



Total Views
Views on SlideShare
Embed Views



3 Embeds 33

http://www.linkedin.com 26
https://www.linkedin.com 6
https://bozuman.cybozu.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Operating a Highly Available Cloud Service Operating a Highly Available Cloud Service Presentation Transcript

  • Operating a Highly Available Cloud Service November 14, 2013 Depankar Neogi Chief Architect QuickBase, Intuit Inc. Presented at Boston Cloud Services Meetup http://www.meetup.com/Boston-cloud-services/events/141118632/
  • Agenda • Intuit and QuickBase • Building and Running Highly Available Cloud Services –People & Process –Technology The single most important thing to keep in mind when designing for High Availability is to anticipate failure. 2
  • Improving #1 Financial Management Software Facilitate $40B Tax Refunds 3 60M Lives #1 for Innovation in Computer Software Industry 20% of GDP & Pay 1 in 12 Apps for >50% of Fortune 500
  • What is QuickBase? Easily customized to meet unique business needs Excel to QuickBase in less than 5 minutes Brand NEW modern UI enables Ease of Use An Enterprise platform to empower your team to build applications Requirements, processes and teams evolving constantly More than 4,500 companies use QuickBase 500,000+ current users One platform solves jobs across the enterprise. Project Management, IT helpdesk, CRM, Field service, Human resources, etc. 4
  • QuickBase – Customized applications matching your unique requirements Roles Based UI Dashboards & Reports Data Storage & Backup Secure Access Control Relational Data Tables Business logic & workflow Open extensible API’s Common Infrastructure Services 5
  • Modern, Easy, Productive, Dynamic, Fast 30 million requests per day 80 K unique visitors per day 100,000 active apps at any time 25 milliseconds median processing time Supports Dynamic DML, DDL, CRUD Cloud based Database with a beautiful UX 6
  • New QuickBase DIY Data Access Liberators Data Mapping WSQL Transforms Virtual tables Liberator Cache Library Warehouse Scheduler Repository 1. QuickBase UI Extended with new DIY data sharing 2. New Data Sharing Service A N Y A P I 3. Connections to Popular Industry Data Intuit-class infrastructure (security, billing, HADR, hosting) 8
  • PSTN Systems Availability SLA Downtime 99.9999 %  “six nines”  31.5 secs/yr, 2.59 secs/month, 0.605 secs/week 99.999 % 10  “five nines”  5.26 mins/yr, 25.9 secs/month, 6.05 secs/week
  • Web Services Availability SLA Downtime 99.95 %  4.38 hrs/yr, 21.56 mins/month, 5.04 mins/week 99.9 % 11  8.76 hrs/yr, 43.8 mins/month, 10.1 mins/week
  • 12 http://www.google.com/apps/intl/en/terms/sla.html
  • Operating High Availability Service PEOPLE & PROCESSES 13
  • People & Process: Monitoring Business Metrics • It’s critical to detect a problem before your customers have to tell you or you have to ask them. • By monitoring real time business metrics and comparing the actual data to a historical curve you can more quickly detect if there is a problem and avoid sifting through alerting and monitoring white noise that your systems will inevitability produce. • Five evolutionary questions that monitoring should answer: 1. 2. 3. 4. 5. Is there a problem? Where is the problem? What is the problem? Why is there a problem? Will there be a problem? • External versus Internal Monitoring http://akfpartners.com/techblog/2009/06/15/monitoring-strategies/ 14
  • People & Process: Invest in Good Tools A good tool will help you find the needle in a haystack - fast 95 K Requests in 12 hour window Peak Request: 4.3 req/sec (1286 request/5 min window) 15 Processing Time: 61 millisecond per request
  • People & Process: Incident Management Process • • • • • • • • • Incident Management Team (IMT) Incident Management Response Plan Activating the IMT, notifications Having the right break-out rooms Classification of the incident Communication of the incident Time keeper Management versus Technical Process Tracking: – SLA – RPO (recovery point objective) – RTO (recovery time objective) • Incident closure, recovery • Evaluation process 16
  • People & Process: Runbook and messaging • Runbook – Detail process for managing the incident – Contact Information – Managing data center cutover, recovery steps, testing, managing replication • Messaging book – – – – – Who is responsible for communication Who creates and approves the message How you communicate At what cadence What you tell your customers • Social Media Strategy – – 17 If you are not transparent, your customers will let you know Social Media coordinator – own the channels
  • People & Process: Service Page Provide Customers ability to find out the health of the system and be notified of any service related issues 18
  • People & Process: Service Page Transparency is Key. If you let the customers know what you know, they will respect you and may remain loyal to your business. 19
  • People & Process: Business Fault Isolation • • • • • What if your data center went down And the production server is down because the data center is down And your email server was in the same data center And your marketing server was in the same data center And your service page was on a server in the same date center • How do you communicate with all your customers? Business Fault Isolation prevents your business from a SPOF (single point of failure). 20
  • People & Process: Review Process • SaaS or Operations Review Process should have a fixed cadence and be led by a company leader • Review Team should include leaders from: – Finance – Compliance & Risk – CTO – Operations – Product • Dashboard with KPI • Review Fire drills • Change Control Process – Preferably change one thing at a time 21
  • Operating High Availability Service TECHNOLOGIES 22
  • The Three Pillars of High Availability The goal of High Availability and Disaster Recovery (HA/DR) is to provide Business Continuance through: Lack of Service Outage = Happy Customers = Greater Business Value HA/DR directly enhances a customer’s experience through greater offering availability
  • High Availability Architecture Principles • Design for Failure – Avoid Single Points of Failure – Graceful Degradation and Soft Dependencies – Asynchronous Design – Keep State Confined to Where it is Needed • Design for Operability – Design to be Monitored – Design for Hot Deployment and Rollback – Automate Where Possible • Keep Everything “In Production” • Scale Out (Not Up) • Keep it Fresh…and Mature
  • Architecture Patterns for High Availability Swimlanes 1) 2) Active/Active 3) Single Write Master 4) 25 Active/Passive Store and Forward
  • Active / Passive Primary Data Center Secondary Data Center Near Real-time Replication Active Data 26 Passive Back Up
  • Swimlane Principle A “Swimlane” is: A set of predefined systems and software infrastructure tuned to support a predefined workload • Only a portion of an offering’s total users are hosted on any given swimlane Within a Swimlane: – Each Swimlane is independent and self-sufficient and shares no compute/storage resources with other swimlanes – Offering transactions occur within a Swimlane – Only access to Shared Services go outside the Swimlane – Standard Fault Detection and Fault Recovery methods are used 27
  • High Availability with Swimlanes Application Partitioning GTM via Swimlanes DC 1 Fault Domain 1 Fault Domain 2 WS AS Storage 28 WS: web server; AS: app server WS AS Swimlane 2 AS Storage Swimlane 4’ Swimlane 3 Storage WS F5 GTM Storage WS AS Storage WS AS Storage Intuit Proprietary & Confidential WS AS Storage Swimlane 4 AS F5 LTM Swimlane 3’ WS DNS Swimlane 1’ F5 GTM Swimlane 2’ F5 LTM Swimlane 1 DC 2 Internet WS AS Storage
  • Swimlanes Support Application Needs • Scalability • Replicated swimlanes add capacity with linear scalability • Fault Isolation • Complete failure only impacts a subset of users due to application partitioning and data sharding • High Availability • Individual tiers can be made highly available through intra-VM application recovery, intra-swimlane application failover or intra-swimlane VM restart • Disaster Recovery • Disaster recovery is achieved through swimlane failover, either in the same or a remote data center • Automation • The identical nature of a swimlane allows for a high degree of operational automation 29
  • Active / Active – Swim Lanes Global Load Balancer Data Center 1 25% customers Data Center 2 25% customers 25% customers Replication 25% customers DB3 active DB1 active ----------------- ----------------- DB1 passive DB3 passive DB2 active Replication DB4 active ----------------DB4 passive 30 ----------------DB2 passive
  • Active / Active – Single Write Master DC1 DC2 DC3 DC4 Writes Updates Cache Updates Read Cache 31 Read Cache Read Cache Read Cache
  • Design for Failure: Resiliency Patterns Throttling versus Circuit Breaker 32
  • Circuit Breaker Pattern Circuit Breaker State Diagram Caller C Dependency Closed On call/ pass through Open Trip breaker D Call succeeds / reset count On Call / Fail Call fail/count failure On timeout / attempt reset Threshold reached/trip breaker Trip breaker Attempt Attempt Reset Reset Half Open On call / pass through On succeed/reset On fail /trip breaker http://techblog.netflix.com/2012_02_01_archive.html 33
  • 34 http://techblog.netflix.com/2012_02_01_archive.html Circuit Breaker Pattern : Example
  • 35 http://techblog.netflix.com/2012_02_01_archive.html Circuit Breaker Pattern: Example Example of how threads, network timeouts and retries combine
  • Examples of Tools for Building HA Systems • • • • • • • • • • • • • • 36 Highly Available DNS– Akamai, Dyn, AWS Route53 Load Balancing – F5 LTM, F5 GTM, AWS ELB Data Replication – Golden Gate Monitoring – eHealth, Spectrum, Wily, Splunk, Cacti Application Performance – DynaTrace, NewRelic Deployment – Perforce, Maven, Nexus, Hudson, Puppet Distributed Databases – NuoDB, VoltDB, several NoSQL types Distributed Storage – GlusterFS, Atmos, OpenStack HA Devices – Veritas Cluster Server OS Virtualization – AWS, Mware, Xen, Parallels Network Virtualization – AWS, Mware NSX, PLUMgrid Caching– Memcached, Akamai, CloudFront Caching– Netflix Chaos Monkey DDos Protection– Arbor, Riverbed
  • Trust Not the Execution Environment “Everything Fails, All the Time.” – Werner Vogels, CTO of Amazon.com 37
  • Summary: Operating HA Service Monitoring Business Metrics Incident Management Process Runbooks Social Media & Messaging Service Page Business Fault Isolation SLA, RPO, RTO Failover Drills Review Process Change one thing at a time Principles: – – – – – Design for Failure Design for Operability Keep Everything “In Production” Scale Out (stateless) Keep it Fresh Patterns: – – – – Active/Active Swimlanes Active/Passive Store-Forward Design: – – – – – 38 Throttling Circuit Breaker Caching Rollback Healthchecks Tools
  • Thank You! 39