Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018

© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Closing Loops and Opening Minds: How to
Take Control of Systems, Big and Small
Colm MacCárthaigh
Senior Principal Engineer
AWS
A R C 3 3 7

“Quality is not an act, it is a habit”
Aristotle, some time around 350BC

Amazon CloudFront Control Plane

CloudFront Control Plane

CloudFront Control Plane
(-, +)
(-, -)(+, -)
(+, +)

What goes into high quality designs
Diverse creative minds working in a fearless environment
Systematic reviews and mechanisms to share lessons
Use well-worn patterns where possible and focus
invention where it is truly needed
Testing, testing, testing, testing, testing

How we make trade offs in design

Control Planes Vs Data planes
Control Planes are often a bigger design
challenge than the data planes that they
support.
Poorly designed Control Planes have the
ability to cause large outages, or worse:
misconfigurations and corruption.

What do Control Planes do in the Cloud?
Manage the life cycle for resources
Provision software
Provision service configuration
Provision user configuration

Control Theory 101
• Independently discovered in several
fields of engineering and science
• Formalized in the early-to-mid
twentieth century
• One of the most under-appreciated
branches of science, incredibly relevant
to distributed systems

Control Theory 101

Control Theory 101
PID

Pattern 1: Checksum all of the things

Pattern 1: Checksum all the things
watch:
out:
for:
- YAML
this:
file:
can:
be:
-truncated

Pattern 2: Cryptographic Authentication
Encrypt and authenticate everything! Control Planes
are powerful and security critical systems
Be able to revoke and rotate every credentials. But also
watch out for certificate expiries
Prevent human access to production credentials
Never allow a non-production control plane to talk to
the production data plane

Pattern 3: Cells, Shells, and Poison Tasters
We divide up our control planes horizontally into
regions, availability zones and cells
It’s also common to compartmentalize control
planes so that the data plane is insulated from
control plane crashes
Poison tasters: check up front that is a change is
safe

Pattern 4: Asynchronous Coupling
Synchronous systems are very strongly coupled
A problem in a synchronous downstream
dependency has immediate impact on the
upstream callers
Retries from upstream callers can all-too-easily
fan-out and amplify problems

Pattern 4: Asynchronous Coupling
Asynchronous coupling systems tend to be more
tolerant
Can make partial progress even when some
components are unavailable
Workflows and queues can be tuned to have
deterministic retry behaviors

Pattern 5: Closed Feedback Loops

Pattern 6: Small pushes and large pulls
Very Frequently Asked Question: Is it better to
push, or to pull?
For example: should data plane hosts accept
connections and be pushed configurations, or
should they connect to the control plane and pull
them?
It’s really the wrong question!

Pattern 6: Small pushes and large pulls
Long lived connections can support pushing
timely updates regardless of the “direction” of
the connection
Better to ask: which fleet is bigger? In general,
small fleets should connect to bigger fleets.
This avoids the problems of small fleets being
overwhelmed with thundering herds and retry
storms

Pattern 7: Avoiding Cold Starts and Cold Caches
Caches are bi-modal systems. Super fast when
they have entries, and slow when they are empty
A thundering herd hitting a cold cache can
prevent it from ever getting warm
Retry storms often need to be moderated by
throttles

Pattern 7: Avoiding Cold Starts and Cold Caches
Work out if you really need a cache at all
Pre-warm caches before accepting requests
Consider serving stale entries when backends are
unavailable

Pattern 8: Throttles
Throttles and rate-limits are often needed to
moderate problem requestors and to dampen
fluctuating systems
Example: Amazon Elastic Load Balancer and
Amazon Elastic Compute Cloud (Amazon EC2)
Takes careful work to ensure that throttling does
not impact the end customer experience

Pattern 9: Deltas
What happens when we do have too much
configuration state to push around?
More efficient to compute deltas and distribute
patches
But how do we actually do that?

Pattern 9: Deltas
Key Value
foo bar

Pattern 9: Deltas
Key Value Version
foo bar 1

Pattern 9: Deltas
Key Value Version
foo bar 1
foo baz 2

Pattern 9: Deltas
Key Value Version
foo bar 1
foo baz 2
foo bar 3

Pattern 10: Modality and Constant-Work
So far, we can build a loosely coupled control
plane, with deltas to minimize work, and throttles
to keep things safe
But what if a LOT of things change at the same
time?
We don’t want to build up backlogs and queues
and introduce lag

Systems that change performance in response to
workload or data patterns can be fragile
Example: Relational databases are great for
flexible business queries, but terrible for stable
control planes. Hidden optimizations and query
plan flips can wreck chaos
Deployments, peak events, power events, all incur
risk because they can be new modes

How dumb would it be to make a really really
simple control plane?
User calls an API that edits a configuration file on
Amazon Simple Storage Service (Amazon S3).
Push that configuration file every 10 second …
whether it changed or not!
Very very reliable and robust

Our Network health checks, including Amazon
Route 53 Health Checks are a good example
Health Checks are happening all of the time
Results being published to consumers, all of the
time
Zone or Region failure = no difference!

100 nodes requesting a configuration every
second
$1200 / year in request costs

What did we learn about building stable systems?
Closing loops is critical, measure the progress!
Loose asynchronous coupling helps
Think about the modalities of the system
Our lessons are baked into Amazon API Gateway
and AWS Lambda

Thank you!

Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018

Similar to Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Closing Loops and Opening Minds: How to Take Control of Systems, Big and Small (ARC337) - AWS re:Invent 2018