Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Operations: Production Readiness Review – How to stop bad things from Happening


Published on

There is more to deploying code than pushing the deploy button. A good practice that many companies follow is a Production Readiness Review (PRR) which is essentially a pre-flight check list before a service launches. This helps ensure new services are properly architected, monitored, secured, and more. We’ll walk through an example PRR and discuss the value of ensuring each of these is properly taken care of before your service launches.

  • Be the first to comment

Operations: Production Readiness Review – How to stop bad things from Happening

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chris Munns Fall 2017 AWS Startup Day Production Readiness Review
  2. 2. About me: Chris Munns -, @chrismunns • Senior Developer Advocate - Serverless • New Yorker • Previously: • AWS Business Development Manager – DevOps, July ’15 - Feb ‘17 • AWS Solutions Architect Nov, 2011- Dec 2014 • Formerly on operations teams @Etsy and @Meetup • Little time at a hedge fund, Xerox and a few other startups • Rochester Institute of Technology: Applied Networking and Systems Administration ’05 • Internet infrastructure geek
  3. 3. “Everything fails all the time.” Werner Vogels, CTO,
  4. 4. Production Readiness Review You don’t need all of these from day one, grow them as your teams grow. Architecture Design Review Monitoring Logging Documentation Alerting Service Level Agreement Expected Throughput Testing Deploy Strategy
  5. 5. Architecture Design Review
  6. 6. Architecture Design Review Netflix Chaos Engineering 1. Define the system’s normal behavior — its “steady state” — based on measurable output like overall throughput, error rates, latency, etc. 2. Hypothesize about the steady state behavior of an experimental group, as compared to a stable control group. 3. Expose the experimental group to simulated real-world events such as server crashes, malformed responses, or traffic spikes. 4. Test the hypothesis by comparing the steady state of the control group and the experimental group. The smaller the differences, the more confidence we have that the system is resilient. TLDR; Intentionally break things, compare measured with expected impact, and correct any problems uncovered this way.
  7. 7. Architecture Design Review Highly Available & Redundant Problem Solution Failure of a service in a specific location Run across multiple availability zones or regions Able to handle spikes of traffic Have auto-scaling in place with EC2, Containers, or through leveraging serverless architectures. Avoid Single Points of Failure (SPOF) Be sure services are running in clusters scaled across AZs. Replication > Backups.
  8. 8. Architecture Design Review Using Standard Libraries & Design Patterns Standardizing on libraries, languages, styleguides makes onboarding new developers and troubleshooting issues easier. Enforce these programmatically where you can. (eslint, gofmt, etc) Spot situations where code may be duplicated and able to be refactored. Look for opportunities to implement good design patterns. Know your licenses - OpenSource Permissive (MIT/Apache) vs Copy Left (GNU/MPL)
  9. 9. Architecture Design Review Review for Security Best Practices Security should always be a top priority Ensure no credentials are being stored in the application Code defensively for SQL injections, XSS attacks, and more Leverage Static Analysis tools Consider using Pre-Commit by Yelp
  10. 10. Architecture Design Review Leverage other startups or rotate teams to keep fresh eyes on your code Partner with another startup to help each other with architecture, code review, interviewing, and more. Consider rotating developers off of projects every few months to gain fresh eyes on projects.
  11. 11. Monitoring
  12. 12. Monitoring Application vs Service Level Alerting AppWeb DB Application Level Service Level AppWeb DB
  13. 13. Monitoring Performance Metrics Start by building a dashboard of “important” metrics. Continue iterating on this as you learn more about your system under inspection. Each system has a “heartbeat” that will appear off when things are unhealthy. You always think you have enough metrics being gathered until you need the one you’re missing. When applications fail, the more data you can observe the easier it is to get to the root cause. Averages hide issues. Be sure to leverage percentiles to expose where users are experiencing issues. Complicated systems build complicated dependency chains. Small fluctuations in one part of your stack can manifest itself in other parts.
  14. 14. Monitoring Application Level Visibility Provides Insight To Application Performance You need visibility into how your application itself is performing. How long are certain calls to resources taking? Is that trending up or down? What part of the application is generating the most number of errors?
  15. 15. Monitoring Averages vs Percentiles
  16. 16. Monitoring Averages vs Percentiles
  17. 17. Monitoring Real User Monitoring (RUM) & Synthetic Monitoring Synthetic Monitoring Automatic testing of your site and service to measure performance. Real User Monitoring Shows your exactly how users are interacting with your site or application. Measures page load times, DNS resolution issues, traffic bottlenecks, and more.
  18. 18. Monitoring Circuit Breakers Orders Invoices APIClient Customers Invoices Orders Customers 81ms 63ms 37ms 181ms
  19. 19. Monitoring Circuit Breakers Orders Invoices APIClient Customers Invoices Orders Customers 81ms 63ms 4082ms 4226ms Slow at handling requests, requests queuing up
  20. 20. Monitoring Circuit Breakers Orders Invoices APIClient Customers Invoices Orders Customers High Error Rate 81ms 63ms 1ms 145ms
  21. 21. Monitoring Circuit Breakers Orders Invoices APIClient Customers Invoices Orders Customers Reduced Error Rate 81ms 63ms 91ms 235ms
  22. 22. Monitoring Circuit Breakers Orders Invoices APIClient Customers Invoices Orders Customers 81ms 63ms 37ms 181ms
  23. 23. Monitoring Circuit Breakers Closed Open Half Open Success Fast Failing Open Try One Request Fail Open Circuit Success Open Circuit
  24. 24. Logging
  25. 25. Logging Consistent Log Format Consider using JSON for logging User Log Levels correctly [INFO/WARN/CRIT] Add context for your logging statements Log behaviors and errors Consider how analytics will be used on this data
  26. 26. Logging UTC Timestamps Centrally aggregated logs make analysis easier Helps prevent mismatch errors due to DST Prepares you for multi-region Log tool interfaces let you adjust time zones per user [2017-07-13 14:49:24.436245]
  27. 27. Logging Individual Transaction IDs The session ID that generated the error The user who encountered the error The user’s location in the application The ID of the transaction or product that caused the error Be careful about what you log from a security perspective Web App Database ID 10948281 ID 10948281
  28. 28. Documentation Store Your Documentation Close To Your Code: What the code does How to install and run it How to interact with it (stop, start, restart) How to configure it How to troubleshoot it What metrics and dashboards are available
  29. 29. Alerting
  30. 30. Alerting "Level 1" Operations Teams Should Be Automated check process nginx with pidfile /var/run/ start program = "/etc/init.d/nginx start” stop program = "/etc/init.d/nginx stop” group www (for centos)
  31. 31. Alerting "Level 1" Operations Teams Should Be Automated EC2 Auto Recovery
  32. 32. Alerting "Level 1" Operations Teams Should Be Automated EC2 Auto Scaling
  33. 33. Alerting Build Proper Escalation Paths For Alerts Primary Secondary Team Management 10 Minutes 10 Minutes 10 Minutes Being paged when something fails is great, but you always need a backup These need to auto escalate when not acknowledged As it escalates up it’s good to notify a wider range of people to get more eyes on the issue Review alerts that have been ack’d or silenced beyond a tolerable threshold.
  34. 34. Alerting Developers Code Should Only Burden Themselves Operations Add Capacity Developer Deploy Hotfix Bad application code causes 40% increase in CPU usage across a cluster. Temporary Fix Permanent Fix
  35. 35. Service Level Agreements
  36. 36. Service Level Agreements/Objectives Services Should Have An SLA/SLO /Search /Cart /Avatars 99.99% 99.999% 99.9% These are internal SLAs for the company Helps identify how much effort should be put into the reliability of each service Important when using microservices for teams to reliably build dependencies on your service.
  37. 37. Service Level Agreements Understand The Cost Of Adding Each 9 Level of Availability Percent of Uptime Downtime per Year Downtime per Day 1 Nine 90% 36.5 days 2.4 hours 2 Nines 99% 3.65 days 14 minutes 3 Nines 99.9% 8.76 hours 86 seconds 4 Nines 99.99% 52.6 minutes 8.6 seconds 5 Nines 99.999% 5.25 minutes .86 seconds 6 Nines 99.9999% 31.5 seconds 8.6 milliseconds
  38. 38. Expected Throughput Run Load Tests & Understand Your Limits Before a service goes live, know where your breaking points are. Know the bare minimum number of instances needed to run your average throughput Know the maximum throughput you can handle with your current architecture Calculate the throughput per instance ratio so you can accurately setup proper auto-scaling in a cost optimized way.
  39. 39. Expected Throughput Helps with Cost Optimization & Auto Scaling
  40. 40. Expected Throughput Provides Performance Baseline For Future Release 0 500 1000 1500 2000 2500 3000 3500 Max RPS V1 V14 As code evolves, so does your performance. Understand the impact of additional libraries, added lines of code, and new external calls. Here we see a 63.58% increase in performance from V1 to V14. This directly correlates to your infrastructure cost.
  41. 41. Testing
  42. 42. Testing Adopt Automated Testing Early Builds confidence in the code being released Allows you to test more of your application in less time Manual testing can become error prone
  43. 43. Testing Test Driven Development Red GreenRefactor Build a test first, fails. Develop code so it passes. Refactor and optimize the code. Repeat.
  44. 44. Deployment Strategy
  45. 45. Deployment Strategy Database Migrations Understand what changes to the database need to happen to support new code releases. Avoid removing columns, only make additions to reduce risk. Be sure to test migrations against test copies of the database Keep a revision history of database migrations for reference Snapshot databases before doing migrations
  46. 46. Deployment Strategy Canary Pools Version 1 Version 2Load Balancer 10% 90% Version 1 Version 2Load Balancer 100% 0% 0% Errors 0% Errors
  47. 47. Deployment Strategy Dark Deploys & Feature Flags Opt In Test new features with selected users Kill Switch Disable poorly performing features Scalable Roll Outs Do % roll outs of new features Block Users Prevent selected users from features Run A/B Tests Test and compare new features Sunset Old Features Safely decommission old features
  48. 48. Error Budget Spend it! It’s there for you to use. Error budget is there for you to take calculated risks in your environment. Allows you to save up a high budget to spend it on major architectural changes. Some companies force the spending of this budget when it’s not utilized to encourage services built on it to gracefully fail. If the SLA is 99.99% and it’s running at 100%, they will manually force downtime to stay at 99.99%.
  49. 49. Production Readiness Review Summary of key areas for a PRR Architecture Design Review Monitoring Logging Documentation Alerting Service Level Agreement Expected Throughput Testing Deploy Strategy
  50. 50. Resources Useful resources related to the topics covered Production Readiness Review: Netflix Hystrix Circuit Breaker: Feature Flags: Error Budgets: Monitoring Philosophies:
  51. 51. Chris Munns @chrismunns