SlideShare a Scribd company logo
S U M M I T
SYDNEY
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The theory and practice, practice,
practice of AWS Operations
Colm MacCárthaigh
Senior Principal Engineer
Amazon Web Services
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Do you carry a pager?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Industry-wide recovery rate
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is operations?
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is operations?
Operations is the doing of things
Over and over, better and better
How do we form good habits, and prevent bad habits
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is operations?
Great operations is built on humility
We build robust systems and designs, but anticipate that
there will be something that we didn’t think of
Healthy paranoia helps too
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is operations?
Every operational action is reviewed
”Two Person Rule” for anything that is not routine
Constant strive to automate routine actions away
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What are we going to cover?
What is different at scale?
What is operational risk?
Compartmentalisation
Deployment Safety
The Operational Mindset
Staying SAFE when things go
wrong
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is different at scale?
Something is “broken” all of the time
The stakes are higher
The number of people involved is larger
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
What is different at scale?
There are greater opportunities to perfect
automation and operational practices
There is more experience to learn from
The number of people involved is larger
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How we think about operational risk
Operational risk Is really related to change
• Key types of change
• Component failure
• Environmental failure
• Increases in load
• New code paths being exercised
• New modes of operation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How we think about operational risk
Operational risk Is really related to change
• Key types of change
• Component failure
• Environmental failure
• Increases in load
• New code paths being exercised
• New modes of operation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How we think about operational risk
Operational risk Is really related to change
• Key types of change
• Component failure
• Environmental failure
• Increases in load
• New code paths being exercised
• New modes of operation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How we think about operational risk
Operational risk Is really related to change
• Key types of change
• Component failure
• Environmental failure
• Increases in load
• New code paths being exercised
• New modes of operation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How we think about operational risk
Operational risk Is really related to change
• Key types of change
• Component failure
• Environmental failure
• Increases in load
• New code paths being exercised
• New modes of operation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
How we think about operational risk
Operational risk Is really related to change
• Key types of change
• Component failure
• Environmental failure
• Increases in load
• New code paths being exercised
• New modes of operation
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
ap-southeast-2
us-east-2
eu-west-1
…
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
ap-southeast-2b
us-east-2b
eu-west-1b
…
ap-southeast-2a ap-southeast-2c
us-east-2c
eu-west-1ceu-west-1a
us-east-2a
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
ap-southeast-2b
us-east-2b
eu-west-1b
…
ap-southeast-2a ap-southeast-2c
us-east-2c
eu-west-1ceu-west-1a
us-east-2a
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
ap-southeast-2b
us-east-2b
eu-west-1b
…
ap-southeast-2a ap-southeast-2c
us-east-2c
eu-west-1ceu-west-1a
us-east-2a
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Compartmentalisation
ap-southeast-2b
us-east-2b
eu-west-1b
…
ap-southeast-2a ap-southeast-2c
us-east-2c
eu-west-1ceu-west-1a
us-east-2a
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Deployment Safety
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Deployment Safety
New code means risk, so we are incredibly
paranoid about deploying it
CI/CD staged deployment process
Promotion testing and monitoring at every
stage, with automated rollback
Fast and reliable rollback
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Deployment Safety
1. Code-review
2. Check-in
3. Pre-Production
4. One Box
5. One Availability Zone
6. One Region
7. Onwards …
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Deployment Safety
Reliable and fast rollback is key
Random Selection is useful to avoid repeat issues
Services have to be designed with phased deployment
and rollback in mind
What if we’re making backwards-incompatible changes?
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The operational mindset
Services are not simply code in an editor
Services are live running systems that respond to input
Services change over time even if you don’t change the code
There is no “done”
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The operational mindset
Every week at AWS we have an ‘All Hands’ operations meeting
We dig into any COEs (Corrections of Error) from the previous week
We choose some services at random and dive into their operational
metrics
We look at operational sustainability too
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The operational mindset
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The operational mindset
Every team has a corresponding weekly operations meeting:
review the previous weeks metrics
We pay close attention to any alarms that fired, and any
tickets that were cut
On-call report from the engineers who were on-call
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Staying SAFE when things go wrong
Every team has an active on-call engineer
On-call engineers are automatically engaged for most issues
CloudWatch Alarms -> tracking ticket -> page
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Staying SAFE when things go wrong
For larger issues, or when there is elevated risk, we use voice conference
calls to coordinate
Every call has a designated “Call Leader” and a ”facilitator” from our
technical operations team
Call leaders are experienced and tenured AWS staff and are empowered
to make decisions
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
When
doesn’t apply
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
The Theory and Practice, Practice,
Practice of AWS Operations
Stay calm
Assess the situation
Focus on mitigation
Escalate early and often
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Stay calm
Big events can be overwhelming, but cooler heads prevail. Keep a sense
of urgency, but be methodical and avoid a frenetic energy. Most of all,
don’t panic! Remember that we have a 100% success rate of mitigating
operational events, so take a quick breath if you need to, we will solve
each and every challenge. Try to avoid team huddles that are not on a
conference call with an experienced designated leader, as they split focus
and detract from resolution. Call-leaders can help you prioritise and
engage others. Use the ticket and chime or IRC for communication too.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Assess the situation
If you join a call, first read the associated ticket or SIM, especially the
summary. If the call has a large number of participants then please don’t
interrupt it to check in, instead use the ticket/SIM. Assess your own
dashboards, alarms and alerts and be prepared to report the summary
for your own area, service or team. Do this continuously for the duration
of the call. Every call participant should be active. If you can, take notes
for yourself as you go; the names of other people on the call, key pieces
of technology or information you might not be familiar with, etc …
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Focus on mitigation, not root cause analysis
Most issues can be remediated much more quickly with rollbacks, data
center flips, database failover, throttling and other operational actions
that essentially quench the source of the problem rather than truly
“fixing” it in a deeper sense (i.e. patching the code). At Amazon we are
happy to roll back speculatively, on the basis that it might fix the
problem. More often than not, it is not necessary to truly understand the
problem, so defer root cause analysis until you have either exhausted
traditional rollback/flipping steps for your service, or for when you can
do remediation actions and root cause analysis in parallel.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Escalate early and often
The absolute minimum knowledge for every on-call engineer is to know
how to escalate: be able to page your secondary, your manager, and any
other escalation channels appropriate for your service. It is always ok to
escalate! If you yourself are a secondary or an escalation person, be
grateful and supportive too for getting paged. If an LSE is impacting your
service, and if recovery steps are likely, escalate early and get help!
Paging early has been shown to save tens of minutes of impact time
during typical events.
© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Takeaways
• At AWS teams own their
business, development, and
operations
• We think of operational risk
in terms of change
• Compartmentalisation limits
the blast radius of issues
• Steady deployment safety
wrangles risks associated
with software and
configuration changes
• AWS company culture values
operational excellence
• SAFE when things go wrong
Thank you!
S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Colm MacCárthaigh
colmmacc@amazon.com

More Related Content

What's hot

Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Amazon Web Services
 
What is a Bot and why you should care
What is a Bot and why you should careWhat is a Bot and why you should care
What is a Bot and why you should care
Elisabeth Bitsch-Christensen
 
AWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summit Singapore 2019 | Transformation Towards a Digital Native EnterpriseAWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summits
 
Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...
Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...
Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...
Amazon Web Services
 
Keynotes Akamai Trust No One City Tour
Keynotes Akamai Trust No One City TourKeynotes Akamai Trust No One City Tour
Keynotes Akamai Trust No One City Tour
Elisabeth Bitsch-Christensen
 
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit SydneyInnovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Amazon Web Services
 
Accelerated Transformation through Training
Accelerated Transformation through TrainingAccelerated Transformation through Training
Accelerated Transformation through Training
Amazon Web Services
 
AWS Sydney Summit 2019 Re:Cap
AWS Sydney Summit 2019 Re:CapAWS Sydney Summit 2019 Re:Cap
AWS Sydney Summit 2019 Re:Cap
Injae Kwak
 
How to Counter Cybersecurity Attacks - Trust No One
How to Counter Cybersecurity Attacks - Trust No OneHow to Counter Cybersecurity Attacks - Trust No One
How to Counter Cybersecurity Attacks - Trust No One
Elisabeth Bitsch-Christensen
 
Developing and Teaching Robotics with AWS RoboMaker
Developing and Teaching Robotics with AWS RoboMaker Developing and Teaching Robotics with AWS RoboMaker
Developing and Teaching Robotics with AWS RoboMaker
Amazon Web Services
 
Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWS
Injae Kwak
 
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWSAWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summits
 
AWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI services
AWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI servicesAWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI services
AWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI services
Amazon Web Services
 
Infrastructure Asset Management – Making Everything as Simple as Possible but...
Infrastructure Asset Management – Making Everything as Simple as Possible but...Infrastructure Asset Management – Making Everything as Simple as Possible but...
Infrastructure Asset Management – Making Everything as Simple as Possible but...
Waugh Infrastructure Management Ltd.
 
Scale as an Enabler for Security
Scale as an Enabler for SecurityScale as an Enabler for Security
Scale as an Enabler for Security
scoopnewsgroup
 
Building a Better Scala Community
Building a Better Scala CommunityBuilding a Better Scala Community
Building a Better Scala Community
Kelley Robinson
 

What's hot (17)

Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
Developing intelligent robots with AWS RoboMaker - SVC207 - Santa Clara AWS S...
 
AWS Startup Day- softchef
AWS Startup Day- softchefAWS Startup Day- softchef
AWS Startup Day- softchef
 
What is a Bot and why you should care
What is a Bot and why you should careWhat is a Bot and why you should care
What is a Bot and why you should care
 
AWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summit Singapore 2019 | Transformation Towards a Digital Native EnterpriseAWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
AWS Summit Singapore 2019 | Transformation Towards a Digital Native Enterprise
 
Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...
Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...
Prepare For The Next Phase of Your AWS Journey With CloudHealth (Session spon...
 
Keynotes Akamai Trust No One City Tour
Keynotes Akamai Trust No One City TourKeynotes Akamai Trust No One City Tour
Keynotes Akamai Trust No One City Tour
 
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit SydneyInnovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
Innovating at Scale – Lessons Learned Growing Alexa - AWS Summit Sydney
 
Accelerated Transformation through Training
Accelerated Transformation through TrainingAccelerated Transformation through Training
Accelerated Transformation through Training
 
AWS Sydney Summit 2019 Re:Cap
AWS Sydney Summit 2019 Re:CapAWS Sydney Summit 2019 Re:Cap
AWS Sydney Summit 2019 Re:Cap
 
How to Counter Cybersecurity Attacks - Trust No One
How to Counter Cybersecurity Attacks - Trust No OneHow to Counter Cybersecurity Attacks - Trust No One
How to Counter Cybersecurity Attacks - Trust No One
 
Developing and Teaching Robotics with AWS RoboMaker
Developing and Teaching Robotics with AWS RoboMaker Developing and Teaching Robotics with AWS RoboMaker
Developing and Teaching Robotics with AWS RoboMaker
 
Building a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWSBuilding a Real-Time Data Platform on AWS
Building a Real-Time Data Platform on AWS
 
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWSAWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
AWS Summit Singapore 2019 | Driving Business Outcomes with Data Lake on AWS
 
AWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI services
AWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI servicesAWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI services
AWS Summit Singapore 2019 | Accelerating ML Adoption with Our New AI services
 
Infrastructure Asset Management – Making Everything as Simple as Possible but...
Infrastructure Asset Management – Making Everything as Simple as Possible but...Infrastructure Asset Management – Making Everything as Simple as Possible but...
Infrastructure Asset Management – Making Everything as Simple as Possible but...
 
Scale as an Enabler for Security
Scale as an Enabler for SecurityScale as an Enabler for Security
Scale as an Enabler for Security
 
Building a Better Scala Community
Building a Better Scala CommunityBuilding a Better Scala Community
Building a Better Scala Community
 

Similar to The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sydney

Automated Security Remediation
Automated Security RemediationAutomated Security Remediation
Automated Security Remediation
Amazon Web Services
 
Are you Well Architected?
Are you Well Architected?Are you Well Architected?
Are you Well Architected?
Amazon Web Services
 
Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...
Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...
Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...
Amazon Web Services
 
Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019
Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019
Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019
Amazon Web Services
 
클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019
클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019
클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019
Amazon Web Services Korea
 
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS SummitThreat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Amazon Web Services
 
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
Amazon Web Services
 
Operando em Escala Preparando-se para a jornada
Operando em EscalaPreparando-se para a jornadaOperando em EscalaPreparando-se para a jornada
Operando em Escala Preparando-se para a jornada
Amazon Web Services LATAM
 
Unified monitoring of the container environment, containers, and applications...
Unified monitoring of the container environment, containers, and applications...Unified monitoring of the container environment, containers, and applications...
Unified monitoring of the container environment, containers, and applications...
Amazon Web Services
 
Introduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Introduction to AWS Global Accelerator - SVC212 - New York AWS SummitIntroduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Introduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Amazon Web Services
 
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
Amazon Web Services
 
Becoming A High Frequency Enterprise
Becoming A High Frequency EnterpriseBecoming A High Frequency Enterprise
Becoming A High Frequency EnterpriseAmazon Web Services
 
Keynote_AWS_BecomingAHighFrequencyEnterprise
Keynote_AWS_BecomingAHighFrequencyEnterpriseKeynote_AWS_BecomingAHighFrequencyEnterprise
Keynote_AWS_BecomingAHighFrequencyEnterprise
Amazon Web Services
 
Cloud Operating Models for Accelerated Cloud Transformation - AWS Summit Sydney
Cloud Operating Models for Accelerated Cloud Transformation - AWS Summit SydneyCloud Operating Models for Accelerated Cloud Transformation - AWS Summit Sydney
Cloud Operating Models for Accelerated Cloud Transformation - AWS Summit Sydney
Amazon Web Services
 
성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...
성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...
성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...
Amazon Web Services Korea
 
How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019
How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019 How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019
How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019
Amazon Web Services
 
Automate Security Event Management Using Trust-Based Decision Models - AWS Su...
Automate Security Event Management Using Trust-Based Decision Models - AWS Su...Automate Security Event Management Using Trust-Based Decision Models - AWS Su...
Automate Security Event Management Using Trust-Based Decision Models - AWS Su...
Amazon Web Services
 
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMSLeaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
Amazon Web Services
 
Ramping up on AWS
Ramping up on AWSRamping up on AWS
Ramping up on AWS
Amazon Web Services
 
Rapid Prototyping with AWS - AWS Summit Sydney
Rapid Prototyping with AWS - AWS Summit SydneyRapid Prototyping with AWS - AWS Summit Sydney
Rapid Prototyping with AWS - AWS Summit Sydney
Amazon Web Services
 

Similar to The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sydney (20)

Automated Security Remediation
Automated Security RemediationAutomated Security Remediation
Automated Security Remediation
 
Are you Well Architected?
Are you Well Architected?Are you Well Architected?
Are you Well Architected?
 
Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...
Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...
Introduction to the Well-Architected Framework and Tool - SVC212 - Santa Clar...
 
Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019
Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019
Transform with Cloud to drive your Future | AWS Summit Tel Aviv 2019
 
클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019
클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019
클라우드 세상에서 CIO로 살아남기 - 이한주 대표이사, Bespin Global :: AWS Summit Seoul 2019
 
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS SummitThreat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
Threat detection and mitigation at AWS - SEC201 - Atlanta AWS Summit
 
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
The Zen of governance - Establish guardrails and empower builders - SVC201 - ...
 
Operando em Escala Preparando-se para a jornada
Operando em EscalaPreparando-se para a jornadaOperando em EscalaPreparando-se para a jornada
Operando em Escala Preparando-se para a jornada
 
Unified monitoring of the container environment, containers, and applications...
Unified monitoring of the container environment, containers, and applications...Unified monitoring of the container environment, containers, and applications...
Unified monitoring of the container environment, containers, and applications...
 
Introduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Introduction to AWS Global Accelerator - SVC212 - New York AWS SummitIntroduction to AWS Global Accelerator - SVC212 - New York AWS Summit
Introduction to AWS Global Accelerator - SVC212 - New York AWS Summit
 
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
AWS及客戶在AI/ML的數位運行過程中得到的重要經驗與學習
 
Becoming A High Frequency Enterprise
Becoming A High Frequency EnterpriseBecoming A High Frequency Enterprise
Becoming A High Frequency Enterprise
 
Keynote_AWS_BecomingAHighFrequencyEnterprise
Keynote_AWS_BecomingAHighFrequencyEnterpriseKeynote_AWS_BecomingAHighFrequencyEnterprise
Keynote_AWS_BecomingAHighFrequencyEnterprise
 
Cloud Operating Models for Accelerated Cloud Transformation - AWS Summit Sydney
Cloud Operating Models for Accelerated Cloud Transformation - AWS Summit SydneyCloud Operating Models for Accelerated Cloud Transformation - AWS Summit Sydney
Cloud Operating Models for Accelerated Cloud Transformation - AWS Summit Sydney
 
성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...
성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...
성장하는 스타트업을 위한 아마존 이야기: Lean Innovation and Culture - Gaurav Arora, APAC 스타트업 ...
 
How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019
How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019 How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019
How Pokémon’s SecOps team enables its business - SDD328 - AWS re:Inforce 2019
 
Automate Security Event Management Using Trust-Based Decision Models - AWS Su...
Automate Security Event Management Using Trust-Based Decision Models - AWS Su...Automate Security Event Management Using Trust-Based Decision Models - AWS Su...
Automate Security Event Management Using Trust-Based Decision Models - AWS Su...
 
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMSLeaping Over the Skills Gap - Accelerate Your Journey with AMS
Leaping Over the Skills Gap - Accelerate Your Journey with AMS
 
Ramping up on AWS
Ramping up on AWSRamping up on AWS
Ramping up on AWS
 
Rapid Prototyping with AWS - AWS Summit Sydney
Rapid Prototyping with AWS - AWS Summit SydneyRapid Prototyping with AWS - AWS Summit Sydney
Rapid Prototyping with AWS - AWS Summit Sydney
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
Amazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
Amazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
Amazon Web Services
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Amazon Web Services
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
Amazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
Amazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Amazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
Amazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Amazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
Amazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

The Theory and Practice, Practice, Practice of AWS Operations - AWS Summit Sydney

  • 1. S U M M I T SYDNEY
  • 2. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The theory and practice, practice, practice of AWS Operations Colm MacCárthaigh Senior Principal Engineer Amazon Web Services
  • 3. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Do you carry a pager?
  • 4. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Industry-wide recovery rate
  • 5. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 6. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations?
  • 7. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations? Operations is the doing of things Over and over, better and better How do we form good habits, and prevent bad habits
  • 8. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 9. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations? Great operations is built on humility We build robust systems and designs, but anticipate that there will be something that we didn’t think of Healthy paranoia helps too
  • 10. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is operations? Every operational action is reviewed ”Two Person Rule” for anything that is not routine Constant strive to automate routine actions away
  • 11. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What are we going to cover? What is different at scale? What is operational risk? Compartmentalisation Deployment Safety The Operational Mindset Staying SAFE when things go wrong
  • 12. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 13. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is different at scale? Something is “broken” all of the time The stakes are higher The number of people involved is larger
  • 14. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T What is different at scale? There are greater opportunities to perfect automation and operational practices There is more experience to learn from The number of people involved is larger
  • 15. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 16. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  • 17. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  • 18. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  • 19. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  • 20. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  • 21. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T How we think about operational risk Operational risk Is really related to change • Key types of change • Component failure • Environmental failure • Increases in load • New code paths being exercised • New modes of operation
  • 22. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 23. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  • 24. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  • 25. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  • 26. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  • 27. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  • 28. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation
  • 29. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2 us-east-2 eu-west-1 …
  • 30. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  • 31. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  • 32. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  • 33. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Compartmentalisation ap-southeast-2b us-east-2b eu-west-1b … ap-southeast-2a ap-southeast-2c us-east-2c eu-west-1ceu-west-1a us-east-2a
  • 34. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 35. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety
  • 36. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety New code means risk, so we are incredibly paranoid about deploying it CI/CD staged deployment process Promotion testing and monitoring at every stage, with automated rollback Fast and reliable rollback
  • 37. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety 1. Code-review 2. Check-in 3. Pre-Production 4. One Box 5. One Availability Zone 6. One Region 7. Onwards …
  • 38. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Deployment Safety Reliable and fast rollback is key Random Selection is useful to avoid repeat issues Services have to be designed with phased deployment and rollback in mind What if we’re making backwards-incompatible changes?
  • 39. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 40. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 41. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset Services are not simply code in an editor Services are live running systems that respond to input Services change over time even if you don’t change the code There is no “done”
  • 42. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset Every week at AWS we have an ‘All Hands’ operations meeting We dig into any COEs (Corrections of Error) from the previous week We choose some services at random and dive into their operational metrics We look at operational sustainability too
  • 43. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset
  • 44. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The operational mindset Every team has a corresponding weekly operations meeting: review the previous weeks metrics We pay close attention to any alarms that fired, and any tickets that were cut On-call report from the engineers who were on-call
  • 45. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 46. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Staying SAFE when things go wrong Every team has an active on-call engineer On-call engineers are automatically engaged for most issues CloudWatch Alarms -> tracking ticket -> page
  • 47. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Staying SAFE when things go wrong For larger issues, or when there is elevated risk, we use voice conference calls to coordinate Every call has a designated “Call Leader” and a ”facilitator” from our technical operations team Call leaders are experienced and tenured AWS staff and are empowered to make decisions
  • 48. S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. When doesn’t apply
  • 49. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T The Theory and Practice, Practice, Practice of AWS Operations Stay calm Assess the situation Focus on mitigation Escalate early and often
  • 50. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Stay calm Big events can be overwhelming, but cooler heads prevail. Keep a sense of urgency, but be methodical and avoid a frenetic energy. Most of all, don’t panic! Remember that we have a 100% success rate of mitigating operational events, so take a quick breath if you need to, we will solve each and every challenge. Try to avoid team huddles that are not on a conference call with an experienced designated leader, as they split focus and detract from resolution. Call-leaders can help you prioritise and engage others. Use the ticket and chime or IRC for communication too.
  • 51. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Assess the situation If you join a call, first read the associated ticket or SIM, especially the summary. If the call has a large number of participants then please don’t interrupt it to check in, instead use the ticket/SIM. Assess your own dashboards, alarms and alerts and be prepared to report the summary for your own area, service or team. Do this continuously for the duration of the call. Every call participant should be active. If you can, take notes for yourself as you go; the names of other people on the call, key pieces of technology or information you might not be familiar with, etc …
  • 52. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Focus on mitigation, not root cause analysis Most issues can be remediated much more quickly with rollbacks, data center flips, database failover, throttling and other operational actions that essentially quench the source of the problem rather than truly “fixing” it in a deeper sense (i.e. patching the code). At Amazon we are happy to roll back speculatively, on the basis that it might fix the problem. More often than not, it is not necessary to truly understand the problem, so defer root cause analysis until you have either exhausted traditional rollback/flipping steps for your service, or for when you can do remediation actions and root cause analysis in parallel.
  • 53. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Escalate early and often The absolute minimum knowledge for every on-call engineer is to know how to escalate: be able to page your secondary, your manager, and any other escalation channels appropriate for your service. It is always ok to escalate! If you yourself are a secondary or an escalation person, be grateful and supportive too for getting paged. If an LSE is impacting your service, and if recovery steps are likely, escalate early and get help! Paging early has been shown to save tens of minutes of impact time during typical events.
  • 54. © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T Takeaways • At AWS teams own their business, development, and operations • We think of operational risk in terms of change • Compartmentalisation limits the blast radius of issues • Steady deployment safety wrangles risks associated with software and configuration changes • AWS company culture values operational excellence • SAFE when things go wrong
  • 55. Thank you! S U M M I T © 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved. Colm MacCárthaigh colmmacc@amazon.com