Operations: Production Readiness Review – How to stop bad things from Happening

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chris Munns
Fall 2017
AWS Startup Day
Production Readiness Review

About me:
Chris Munns - munns@amazon.com, @chrismunns
• Senior Developer Advocate - Serverless
• New Yorker
• Previously:
• AWS Business Development Manager – DevOps, July ’15 - Feb ‘17
• AWS Solutions Architect Nov, 2011- Dec 2014
• Formerly on operations teams @Etsy and @Meetup
• Little time at a hedge fund, Xerox and a few other startups
• Rochester Institute of Technology: Applied Networking and
Systems Administration ’05
• Internet infrastructure geek

“Everything fails all the time.”
Werner Vogels, CTO, Amazon.com

You don’t need all of these from day one, grow them as your teams grow.
Architecture Design Review
Monitoring
Logging
Documentation
Alerting
Service Level Agreement
Expected Throughput
Testing
Deploy Strategy

Netflix Chaos Engineering
1. Define the system’s normal behavior — its “steady state” — based on
measurable output like overall throughput, error rates, latency, etc.
2. Hypothesize about the steady state behavior of an experimental group, as
compared to a stable control group.
3. Expose the experimental group to simulated real-world events such as server
crashes, malformed responses, or traffic spikes.
4. Test the hypothesis by comparing the steady state of the control group and
the experimental group. The smaller the differences, the more confidence we
have that the system is resilient.
TLDR; Intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.

Highly Available & Redundant
Problem Solution
Failure of a service in a specific
location
Run across multiple availability zones
or regions
Able to handle spikes of traffic Have auto-scaling in place with EC2,
Containers, or through leveraging
serverless architectures.
Avoid Single Points of Failure (SPOF) Be sure services are running in
clusters scaled across AZs.
Replication > Backups.

Using Standard Libraries & Design Patterns
Standardizing on libraries, languages, styleguides makes onboarding new
developers and troubleshooting issues easier. Enforce these programmatically
where you can. (eslint, gofmt, etc)
Spot situations where code may be duplicated and able to be refactored.
Look for opportunities to implement good design patterns.
Know your licenses - OpenSource Permissive (MIT/Apache) vs Copy Left
(GNU/MPL)

Review for Security Best Practices
Security should always be a top priority
Ensure no credentials are being stored in the application
Code defensively for SQL injections, XSS attacks, and more
Leverage Static Analysis tools
https://en.wikipedia.org/wiki/List_of_tools_for_static_code_analysis
Consider using Pre-Commit by Yelp
http://pre-commit.com

Leverage other startups or rotate teams to keep fresh eyes on your code
Partner with another startup to help each other with architecture, code review,
interviewing, and more.
Consider rotating developers off of projects every few months to gain fresh
eyes on projects.

Monitoring
Application vs Service Level Alerting
AppWeb DB
Application Level
Service Level
AppWeb DB

Monitoring
Performance Metrics
Start by building a dashboard of “important” metrics. Continue iterating on this
as you learn more about your system under inspection. Each system has a
“heartbeat” that will appear off when things are unhealthy.
You always think you have enough metrics being gathered until you need the
one you’re missing. When applications fail, the more data you can observe the
easier it is to get to the root cause.
Averages hide issues. Be sure to leverage percentiles to expose where users
are experiencing issues.
Complicated systems build complicated dependency chains. Small fluctuations
in one part of your stack can manifest itself in other parts.

Monitoring
Application Level Visibility
Provides Insight To Application Performance
You need visibility into how your application itself is performing.
How long are certain calls to resources taking?
Is that trending up or down?
What part of the application is generating the most number of errors?

Monitoring
Averages vs Percentiles

Monitoring
Real User Monitoring (RUM) & Synthetic Monitoring
Synthetic Monitoring
Automatic testing of your site and service to measure performance.
Real User Monitoring
Shows your exactly how users are interacting with your site or application.
Measures page load times, DNS resolution issues, traffic bottlenecks, and
more.

Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
81ms
63ms
37ms
181ms

Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
81ms
63ms
4082ms
4226ms
Slow at handling requests, requests queuing up

Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
High Error Rate
81ms
63ms
1ms
145ms

Monitoring
Circuit Breakers
Orders
Invoices
APIClient
Customers
Invoices
Orders
Customers
Reduced Error Rate
81ms
63ms
91ms
235ms

Monitoring
Circuit Breakers
Closed
Open
Half Open
Success
Fast Failing
Open
Try One
Request
Fail
Open Circuit
Success
Open Circuit

Logging
Consistent Log Format
Consider using JSON for logging
User Log Levels correctly [INFO/WARN/CRIT]
Add context for your logging statements
Log behaviors and errors
Consider how analytics will be used on this data

Logging
UTC Timestamps
Centrally aggregated logs make analysis easier
Helps prevent mismatch errors due to DST
Prepares you for multi-region
Log tool interfaces let you adjust time zones per user
[2017-07-13 14:49:24.436245]

Logging
Individual Transaction IDs
The session ID that generated the error
The user who encountered the error
The user’s location in the application
The ID of the transaction or product that caused the error
Be careful about what you log from a security perspective
Web App Database
ID 10948281 ID 10948281

Documentation
Store Your Documentation Close To Your Code: Read.me
What the code does
How to install and run it
How to interact with it (stop, start, restart)
How to configure it
How to troubleshoot it
What metrics and dashboards are available

Alerting
"Level 1" Operations Teams Should Be Automated
check process nginx with pidfile /var/run/nginx.pid
start program = "/etc/init.d/nginx start”
stop program = "/etc/init.d/nginx stop”
group www (for centos)

Alerting
EC2 Auto Recovery

Alerting
EC2 Auto Scaling

Alerting
Build Proper Escalation Paths For Alerts
Primary
Secondary
Team
Management
10 Minutes
10 Minutes
10 Minutes
Being paged when something fails is great, but you
always need a backup
These need to auto escalate when not acknowledged
As it escalates up it’s good to notify a wider range of
people to get more eyes on the issue
Review alerts that have been ack’d or silenced beyond
a tolerable threshold.

Alerting
Developers Code Should Only Burden Themselves
Operations Add Capacity
Developer Deploy Hotfix
Bad application code
causes 40% increase in
CPU usage across a
cluster.
Temporary Fix
Permanent Fix

Service Level Agreements/Objectives
Services Should Have An SLA/SLO
/Search
/Cart
/Avatars
99.99%
99.999%
99.9%
These are internal SLAs for the
company
Helps identify how much effort should
be put into the reliability of each
service
Important when using microservices
for teams to reliably build
dependencies on your service.
https://landing.google.com/sre/book/chapters/service-level-objectives.html

Service Level Agreements
Understand The Cost Of Adding Each 9
Level of
Availability
Percent of
Uptime
Downtime per
Year
Downtime per
Day
1 Nine 90% 36.5 days 2.4 hours
2 Nines 99% 3.65 days 14 minutes
3 Nines 99.9% 8.76 hours 86 seconds
4 Nines 99.99% 52.6 minutes 8.6 seconds
5 Nines 99.999% 5.25 minutes .86 seconds
6 Nines 99.9999% 31.5 seconds 8.6 milliseconds

Expected Throughput
Run Load Tests & Understand Your Limits
Before a service goes live, know where your breaking points are.
Know the bare minimum number of instances needed to run your average
throughput
Know the maximum throughput you can handle with your current architecture
Calculate the throughput per instance ratio so you can accurately setup
proper auto-scaling in a cost optimized way.

Expected Throughput
Helps with Cost Optimization & Auto Scaling

Expected Throughput
Provides Performance Baseline For Future Release
0
500
1000
1500
2000
2500
3000
3500
Max RPS
V1
V14
As code evolves, so does your
performance.
Understand the impact of additional
libraries, added lines of code, and new
external calls.
Here we see a 63.58% increase in
performance from V1 to V14. This
directly correlates to your infrastructure
cost.

Testing
Adopt Automated Testing Early
Builds confidence in the code being
released
Allows you to test more of your
application in less time
Manual testing can become error
prone

Testing
Test Driven Development
Red
GreenRefactor
Build a test first, fails.
Develop code so it passes.
Refactor and optimize the code.
Repeat.

Deployment Strategy
Database Migrations
Understand what changes to the database need to happen to support new
code releases.
Avoid removing columns, only make additions to reduce risk.
Be sure to test migrations against test copies of the database
Keep a revision history of database migrations for reference
Snapshot databases before doing migrations

Deployment Strategy
Canary Pools
Version 1
Version 2Load Balancer
10%
90%
Version 1
Version 2Load Balancer
100%
0%
0% Errors 0% Errors

Deployment Strategy
Dark Deploys & Feature Flags
Opt In
Test new features with selected
users
Kill Switch
Disable poorly performing features
Scalable Roll Outs
Do % roll outs of new features
Block Users
Prevent selected users from features
Run A/B Tests
Test and compare new features
Sunset Old Features
Safely decommission old features

Error Budget
Spend it! It’s there for you to use.
Error budget is there for you to take calculated risks in your environment.
Allows you to save up a high budget to spend it on major architectural
changes.
Some companies force the spending of this budget when it’s not utilized to
encourage services built on it to gracefully fail. If the SLA is 99.99% and it’s
running at 100%, they will manually force downtime to stay at 99.99%.

Summary of key areas for a PRR
Monitoring
Logging
Documentation
Alerting
Service Level Agreement
Expected Throughput
Testing
Deploy Strategy

Resources
Useful resources related to the topics covered
Production Readiness Review:
https://arxiv.org/pdf/1305.2402.pdf
Netflix Hystrix Circuit Breaker:
https://github.com/Netflix/Hystrix/wiki/How-it-Works
Feature Flags:
https://en.wikipedia.org/wiki/Feature_toggle
Error Budgets:
https://landing.google.com/sre/interview/ben-treynor.html
Monitoring Philosophies:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Chris Munns
munns@amazon.com
@chrismunnshttps://www.flickr.com/photos/theredproject/3302110152/

Operations: Production Readiness Review – How to stop bad things from Happening

Operations: Production Readiness Review – How to stop bad things from Happening

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Operations: Production Readiness Review – How to stop bad things from Happening

Similar to Operations: Production Readiness Review – How to stop bad things from Happening (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Operations: Production Readiness Review – How to stop bad things from Happening