Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018

1,060 views

Published on

Whether it’s distributing configurations and customer settings, launching instances, or responding to surges in load, having a great control plane is key to the success of any system or service. Come hear about the techniques we use to build stable and scalable control planes at Amazon. We dive deep into the designs that power the most reliable systems at AWS. We share hard-earned operational lessons and explain academic control theory in easy-to-apply patterns and principles that are immediately useful in your own designs.

Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018

  1. 1. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small Colm MacCárthaigh Senior Principal Engineer AWS A R C 3 3 7
  2. 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  3. 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  4. 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  5. 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  6. 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  7. 7. “Quality is not an act, it is a habit” Aristotle, some time around 350BC
  8. 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Amazon CloudFront Control Plane
  9. 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudFront Control Plane
  10. 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. CloudFront Control Plane (-, +) (-, -)(+, -) (+, +)
  11. 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What goes into high quality designs Diverse creative minds working in a fearless environment Systematic reviews and mechanisms to share lessons Use well-worn patterns where possible and focus invention where it is truly needed Testing, testing, testing, testing, testing
  12. 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How we make trade offs in design
  13. 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Planes Vs Data planes Control Planes are often a bigger design challenge than the data planes that they support. Poorly designed Control Planes have the ability to cause large outages, or worse: misconfigurations and corruption.
  14. 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What do Control Planes do in the Cloud? Manage the life cycle for resources Provision software Provision service configuration Provision user configuration
  15. 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What do Control Planes do in the Cloud? Manage the life cycle for resources Provision software Provision service configuration Provision user configuration
  16. 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  17. 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101 • Independently discovered in several fields of engineering and science • Formalized in the early-to-mid twentieth century • One of the most under-appreciated branches of science, incredibly relevant to distributed systems
  18. 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101
  19. 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101
  20. 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101
  21. 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101
  22. 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101
  23. 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Control Theory 101 PID
  24. 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  25. 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  26. 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 1: Checksum all of the things
  27. 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 1: Checksum all the things watch: out: for: - YAML this: file: can: be: -truncated
  28. 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 2: Cryptographic Authentication Encrypt and authenticate everything! Control Planes are powerful and security critical systems Be able to revoke and rotate every credentials. But also watch out for certificate expiries Prevent human access to production credentials Never allow a non-production control plane to talk to the production data plane
  29. 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 3: Cells, Shells, and Poison Tasters We divide up our control planes horizontally into regions, availability zones and cells It’s also common to compartmentalize control planes so that the data plane is insulated from control plane crashes Poison tasters: check up front that is a change is safe
  30. 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 4: Asynchronous Coupling Synchronous systems are very strongly coupled A problem in a synchronous downstream dependency has immediate impact on the upstream callers Retries from upstream callers can all-too-easily fan-out and amplify problems
  31. 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 4: Asynchronous Coupling Asynchronous coupling systems tend to be more tolerant Can make partial progress even when some components are unavailable Workflows and queues can be tuned to have deterministic retry behaviors
  32. 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 5: Closed Feedback Loops
  33. 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 6: Small pushes and large pulls Very Frequently Asked Question: Is it better to push, or to pull? For example: should data plane hosts accept connections and be pushed configurations, or should they connect to the control plane and pull them? It’s really the wrong question!
  34. 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 6: Small pushes and large pulls Long lived connections can support pushing timely updates regardless of the “direction” of the connection Better to ask: which fleet is bigger? In general, small fleets should connect to bigger fleets. This avoids the problems of small fleets being overwhelmed with thundering herds and retry storms
  35. 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 7: Avoiding Cold Starts and Cold Caches Caches are bi-modal systems. Super fast when they have entries, and slow when they are empty A thundering herd hitting a cold cache can prevent it from ever getting warm Retry storms often need to be moderated by throttles
  36. 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 7: Avoiding Cold Starts and Cold Caches Work out if you really need a cache at all Pre-warm caches before accepting requests Consider serving stale entries when backends are unavailable
  37. 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 8: Throttles Throttles and rate-limits are often needed to moderate problem requestors and to dampen fluctuating systems Example: Amazon Elastic Load Balancer and Amazon Elastic Compute Cloud (Amazon EC2) Takes careful work to ensure that throttling does not impact the end customer experience
  38. 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 9: Deltas What happens when we do have too much configuration state to push around? More efficient to compute deltas and distribute patches But how do we actually do that?
  39. 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 9: Deltas Key Value foo bar
  40. 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 9: Deltas Key Value Version foo bar 1
  41. 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 9: Deltas Key Value Version foo bar 1 foo baz 2
  42. 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 9: Deltas Key Value Version foo bar 1 foo baz 2 foo bar 3
  43. 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 10: Modality and Constant-Work So far, we can build a loosely coupled control plane, with deltas to minimize work, and throttles to keep things safe But what if a LOT of things change at the same time? We don’t want to build up backlogs and queues and introduce lag
  44. 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 10: Modality and Constant-Work Systems that change performance in response to workload or data patterns can be fragile Example: Relational databases are great for flexible business queries, but terrible for stable control planes. Hidden optimizations and query plan flips can wreck chaos Deployments, peak events, power events, all incur risk because they can be new modes
  45. 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 10: Modality and Constant-Work How dumb would it be to make a really really simple control plane? User calls an API that edits a configuration file on Amazon Simple Storage Service (Amazon S3). Push that configuration file every 10 second … whether it changed or not! Very very reliable and robust
  46. 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 10: Modality and Constant-Work Our Network health checks, including Amazon Route 53 Health Checks are a good example Health Checks are happening all of the time Results being published to consumers, all of the time Zone or Region failure = no difference!
  47. 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Pattern 10: Modality and Constant-Work 100 nodes requesting a configuration every second $1200 / year in request costs
  48. 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What did we learn about building stable systems? Closing loops is critical, measure the progress! Loose asynchronous coupling helps Think about the modalities of the system Our lessons are baked into Amazon API Gateway and AWS Lambda
  49. 49. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  50. 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

×