SlideShare a Scribd company logo
Operations @ Scale
Anurag Gupta, VP
AWS Database Services
Dev/Ops
How I learned to stop worrying and love my pager
Dev/Ops - your dev org is your ops org
I get a pager! You get a pager! Everyone gets a pager!
Why would I possibly want this?
It motivates design for operability
It aligns your interests w/ your customer experience
It improves the feedback loop to customer needs
Monitor everything
Every API call to your service,
Every API call you make to a dependent service
Canary traffic for things that vary (eg SQL statements)
Most of the metrics won’t be meaningful. That’s OK
Page on your high signal-to-noise metrics
Monitor these metrics during deployments
Median/Average, Fleet-wide, coarse time grain are obscuring
Measure TP90, TP99 (99th percentile response time)
Measure at finer and finer grain
Evaluate per-customer metrics
Look for the needles in the haystack
Correction-of-Error (COE) Reporting
Meet weekly on operations (execs, service operators)
Review each issue that happened.
“Spin the wheel” to review a service’s metrics
Support a “truth-seeking” culture
Looking for data, process improvements
COE
- Customer impact
- Timeline: incidence to detection to response to resolution
- 5 Whys? Get to actionable changes to extinguish cause
- Actions
Ops is Dev
Humans are fallible
circa 1% defect injection rate
Error rate changes based on time of day (3am vs 3pm)
New ones show up, have unique issues
Limit human access to machines
Use code/scripts/tools instead
Scripts are code
unit test, code review, deploy, automate
Ops load correlates to business growth
As your business does well, your
operations needs to become great
Growing 100-200% YoY is hard.
Improving ops 100-200% YoY is really
hard.
Improving ops 2% each week is possible.
Use Pareto analysis to prioritize work
Bonus – each customer gets a better
experience even as your own ops load
stays constant
Amazon Redshift has grown rapidly since it became generally
available in February 2013. While our guiding principles have
served us well over the past two years, we now manage many
thousands of database instances and below offer some lessons we
have learned from operating databases at scale.
Design escalators, not elevators: Failures are common when
operating large fleets with many service dependencies. A key
lesson for us has been to design systems that degrade on failures
rather than losing outright availability. These are a common
design pattern when working with hardware failures, for example,
replicating data blocks to mask issues with disks. They are less
common when working with software or service dependencies,
though still necessary when operating in a dynamic environment.
Amazon overall (including AWS) had 50 million code
deployments over the past 12 months. Inevitably, at this scale, a
small number of regressions will occur and cause issues until
reverted. It is helpful to make one’s own service resilient to an
underlying service outage. For example, we support the ability to
preconfigure nodes in each data center, allowing us to continue to
provision and replace nodes for a period of time if there is an
Amazon EC2 provisioning interruption. One can locally increase
replication to withstand an Amazon S3 or network interruption.
We are adding similar mitigation strategies for other external
understanding that, even if not a widespread concern, each issue is
meaningful to the customer experiencing it. In Figure 5, Sev 2
refers to a severity 2 alarm that causes an engineer to get paged.
This means operational load roughly correlates to business
success. Within Amazon Redshift, we collect error logs across our
fleet and monitor tickets to understand top ten causes of error,
with the aim of extinguishing one of the top ten causes of error
each week.
Figure 5: Tickets per cluster over time
Pareto analysis is equally useful in understanding customer
functional requirements. However, it is more difficult to collect.
Escalators, not elevators
Failures happen.
Durability failures are “easy”
mirroring, quorums, well understood techniques
Availability failures are “hard” –
want to degrade on unavailability not cascade failures
tolerate 1-2 hours of unavailability (time to detect, fix)
- eg caching IP addresses when DNS is unavailable
- eg maintaining instance warm pools rather than provisioning
- eg losing the ability to restore a backup, not lose writes
Ship often
Continuous delivery should be to the
customer
Benefits
Customers prefer small patches
Rollback is easier
Rollback is less likely
Faster response to customer issues
We push a new database engine
version, including both features and
bug fixes, every two weeks.
dependencies that can fail independently from the database itself.
Continuous delivery should be to the customer: Many
engineering organizations now use continuous build and
automated test pipelines to a releasable staging environment.
However, few actually push the release itself at a frequent pace.
While customers would prefer small patches to large ones for the
same reasons engineering organizations prefer to build and test
continuously, patching is an onerous process. This often leads to
special-case, one-off patches per customer that are limited in
scope – while necessary, they make patching yet more fragile.
Figure 4: Cumulative features deployed over time
Amazon Redshift is set up to automatically patch customer
clusters on a weekly basis in a 30-minute window specified by the
Cumulative features deployed over time

More Related Content

What's hot

Five (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous DeliveryFive (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous Delivery
Eberhard Wolff
 
AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS
Amazon Web Services
 
4 extreme performance - part ii
4   extreme performance - part ii4   extreme performance - part ii
4 extreme performance - part iisqlserver.co.il
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breaker
Maik Wiesmüller
 
Veeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome waysVeeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome ways
Tanawit Chansuchai
 
MySQL HA Presentation
MySQL HA PresentationMySQL HA Presentation
MySQL HA Presentation
papablues
 
Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoff
Maik Wiesmüller
 
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
VMware
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
Dr Ganesh Iyer
 
Rapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud SandboxesRapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud Sandboxes
Elastra
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
RightScale
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
RightScale
 
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce LabsHow To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
Sauce Labs
 
Continuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumedContinuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumedAtlassian
 
Aug NYC July 12 event
Aug NYC July 12 eventAug NYC July 12 event
Aug NYC July 12 event
AUGNYC
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrency
Maik Wiesmüller
 
Divide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-ServicesDivide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-Services
Carlos Sanchez
 
BlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal MeetupBlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal MeetupBlazeMeter
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeouts
Maik Wiesmüller
 

What's hot (20)

Five (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous DeliveryFive (easy?) Steps Towards Continuous Delivery
Five (easy?) Steps Towards Continuous Delivery
 
AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS AWS Customer Presentation - How Runa uses AWS
AWS Customer Presentation - How Runa uses AWS
 
4 extreme performance - part ii
4   extreme performance - part ii4   extreme performance - part ii
4 extreme performance - part ii
 
Serverless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breakerServerless lessons learned #4 circuit breaker
Serverless lessons learned #4 circuit breaker
 
Veeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome waysVeeam Using cloud connect in 3 unexpected, awesome ways
Veeam Using cloud connect in 3 unexpected, awesome ways
 
MySQL HA Presentation
MySQL HA PresentationMySQL HA Presentation
MySQL HA Presentation
 
Serverless lessons learned #8 backoff
Serverless lessons learned #8 backoffServerless lessons learned #8 backoff
Serverless lessons learned #8 backoff
 
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
Oregon State Solves Critical Storage Pain Points with a Simple, Scalable Solu...
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
Rapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud SandboxesRapidly Deploy Enterprise Cloud Sandboxes
Rapidly Deploy Enterprise Cloud Sandboxes
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
 
Managing RightScale on RightScale
Managing RightScale on RightScaleManaging RightScale on RightScale
Managing RightScale on RightScale
 
Harper Reed: Cloud Contraints
Harper Reed: Cloud ContraintsHarper Reed: Cloud Contraints
Harper Reed: Cloud Contraints
 
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce LabsHow To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
How To Combine Back-End 
 & Front-End Testing with BlazeMeter & Sauce Labs
 
Continuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumedContinuously Delivering: Compress the time from committed to consumed
Continuously Delivering: Compress the time from committed to consumed
 
Aug NYC July 12 event
Aug NYC July 12 eventAug NYC July 12 event
Aug NYC July 12 event
 
Serverless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrencyServerless lessons learned #3 reserved concurrency
Serverless lessons learned #3 reserved concurrency
 
Divide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-ServicesDivide and Conquer: Easier Continuous Delivery using Micro-Services
Divide and Conquer: Easier Continuous Delivery using Micro-Services
 
BlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal MeetupBlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter Presents at the High Performance Drupal Meetup
 
Serverless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeoutsServerless lessons learned #1 custom sdk timeouts
Serverless lessons learned #1 custom sdk timeouts
 

Similar to Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup

Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
Jason Ragsdale
 
Database performance management
Database performance managementDatabase performance management
Database performance management
scottaver
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
5 Quick Wins for the Cloud
5 Quick Wins for the Cloud5 Quick Wins for the Cloud
5 Quick Wins for the Cloud
RightScale
 
Synopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhanSynopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhan
Jatin Chauhan
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
Amazon Web Services
 
Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?
Leonid Grinshpan, Ph.D.
 
ConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web appsConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web apps
Pierre-Luc Maheu
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
Dilip Sharma
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Amazon Web Services
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
Directi Group
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailabilitywebuploader
 
Domino server and application performance in the real world
Domino server and application performance in the real worldDomino server and application performance in the real world
Domino server and application performance in the real worlddominion
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
Directi Group
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
Ravi Yogesh
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
Dynatrace
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep Dive
RightScale
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Amazon Web Services
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
David Mitzenmacher
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Prolifics
 

Similar to Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup (20)

Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
Database performance management
Database performance managementDatabase performance management
Database performance management
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
5 Quick Wins for the Cloud
5 Quick Wins for the Cloud5 Quick Wins for the Cloud
5 Quick Wins for the Cloud
 
Synopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhanSynopsis cloud scalability_jatinchauhan
Synopsis cloud scalability_jatinchauhan
 
Operations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from HappeningOperations: Production Readiness Review – How to stop bad things from Happening
Operations: Production Readiness Review – How to stop bad things from Happening
 
Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?Enterprise applications in the cloud - are providers ready?
Enterprise applications in the cloud - are providers ready?
 
ConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web appsConFoo 2017: Introduction to performance optimization of .NET web apps
ConFoo 2017: Introduction to performance optimization of .NET web apps
 
Nfr testing(performance)
Nfr testing(performance)Nfr testing(performance)
Nfr testing(performance)
 
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
Understanding AWS Database Options (DAT201) | AWS re:Invent 2013
 
Building a Scalable Architecture for web apps
Building a Scalable Architecture for web appsBuilding a Scalable Architecture for web apps
Building a Scalable Architecture for web apps
 
ScalabilityAvailability
ScalabilityAvailabilityScalabilityAvailability
ScalabilityAvailability
 
Domino server and application performance in the real world
Domino server and application performance in the real worldDomino server and application performance in the real world
Domino server and application performance in the real world
 
Handling Data in Mega Scale Systems
Handling Data in Mega Scale SystemsHandling Data in Mega Scale Systems
Handling Data in Mega Scale Systems
 
An introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud ApplicationsAn introduction to Workload Modelling for Cloud Applications
An introduction to Workload Modelling for Cloud Applications
 
Starting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for OpsStarting Your DevOps Journey – Practical Tips for Ops
Starting Your DevOps Journey – Practical Tips for Ops
 
ServerTemplate Deep Dive
ServerTemplate Deep DiveServerTemplate Deep Dive
ServerTemplate Deep Dive
 
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From HappeningStart Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
Start Up Austin 2017: Production Preview - How to Stop Bad Things From Happening
 
7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications7 Stages of Scaling Web Applications
7 Stages of Scaling Web Applications
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 

Recently uploaded

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Anurag Gupta's talk on DevOps at AWS. Nov 17 at the Palo Alto AWS Big Data Meetup

  • 1. Operations @ Scale Anurag Gupta, VP AWS Database Services
  • 2. Dev/Ops How I learned to stop worrying and love my pager Dev/Ops - your dev org is your ops org I get a pager! You get a pager! Everyone gets a pager! Why would I possibly want this? It motivates design for operability It aligns your interests w/ your customer experience It improves the feedback loop to customer needs
  • 3. Monitor everything Every API call to your service, Every API call you make to a dependent service Canary traffic for things that vary (eg SQL statements) Most of the metrics won’t be meaningful. That’s OK Page on your high signal-to-noise metrics Monitor these metrics during deployments Median/Average, Fleet-wide, coarse time grain are obscuring Measure TP90, TP99 (99th percentile response time) Measure at finer and finer grain Evaluate per-customer metrics Look for the needles in the haystack
  • 4. Correction-of-Error (COE) Reporting Meet weekly on operations (execs, service operators) Review each issue that happened. “Spin the wheel” to review a service’s metrics Support a “truth-seeking” culture Looking for data, process improvements COE - Customer impact - Timeline: incidence to detection to response to resolution - 5 Whys? Get to actionable changes to extinguish cause - Actions
  • 5. Ops is Dev Humans are fallible circa 1% defect injection rate Error rate changes based on time of day (3am vs 3pm) New ones show up, have unique issues Limit human access to machines Use code/scripts/tools instead Scripts are code unit test, code review, deploy, automate
  • 6. Ops load correlates to business growth As your business does well, your operations needs to become great Growing 100-200% YoY is hard. Improving ops 100-200% YoY is really hard. Improving ops 2% each week is possible. Use Pareto analysis to prioritize work Bonus – each customer gets a better experience even as your own ops load stays constant Amazon Redshift has grown rapidly since it became generally available in February 2013. While our guiding principles have served us well over the past two years, we now manage many thousands of database instances and below offer some lessons we have learned from operating databases at scale. Design escalators, not elevators: Failures are common when operating large fleets with many service dependencies. A key lesson for us has been to design systems that degrade on failures rather than losing outright availability. These are a common design pattern when working with hardware failures, for example, replicating data blocks to mask issues with disks. They are less common when working with software or service dependencies, though still necessary when operating in a dynamic environment. Amazon overall (including AWS) had 50 million code deployments over the past 12 months. Inevitably, at this scale, a small number of regressions will occur and cause issues until reverted. It is helpful to make one’s own service resilient to an underlying service outage. For example, we support the ability to preconfigure nodes in each data center, allowing us to continue to provision and replace nodes for a period of time if there is an Amazon EC2 provisioning interruption. One can locally increase replication to withstand an Amazon S3 or network interruption. We are adding similar mitigation strategies for other external understanding that, even if not a widespread concern, each issue is meaningful to the customer experiencing it. In Figure 5, Sev 2 refers to a severity 2 alarm that causes an engineer to get paged. This means operational load roughly correlates to business success. Within Amazon Redshift, we collect error logs across our fleet and monitor tickets to understand top ten causes of error, with the aim of extinguishing one of the top ten causes of error each week. Figure 5: Tickets per cluster over time Pareto analysis is equally useful in understanding customer functional requirements. However, it is more difficult to collect.
  • 7. Escalators, not elevators Failures happen. Durability failures are “easy” mirroring, quorums, well understood techniques Availability failures are “hard” – want to degrade on unavailability not cascade failures tolerate 1-2 hours of unavailability (time to detect, fix) - eg caching IP addresses when DNS is unavailable - eg maintaining instance warm pools rather than provisioning - eg losing the ability to restore a backup, not lose writes
  • 8. Ship often Continuous delivery should be to the customer Benefits Customers prefer small patches Rollback is easier Rollback is less likely Faster response to customer issues We push a new database engine version, including both features and bug fixes, every two weeks. dependencies that can fail independently from the database itself. Continuous delivery should be to the customer: Many engineering organizations now use continuous build and automated test pipelines to a releasable staging environment. However, few actually push the release itself at a frequent pace. While customers would prefer small patches to large ones for the same reasons engineering organizations prefer to build and test continuously, patching is an onerous process. This often leads to special-case, one-off patches per customer that are limited in scope – while necessary, they make patching yet more fragile. Figure 4: Cumulative features deployed over time Amazon Redshift is set up to automatically patch customer clusters on a weekly basis in a 30-minute window specified by the Cumulative features deployed over time