Leveraging Automation for a Disposable Infrastructure

May 16-17 2018
Mike Fowler, Senior Site Reliability Engineer
Leveraging Automation for a Disposable Infrastructure

Senior Site Reliability Engineer in the Public Cloud
Practice
Background in Software & Systems Engineering,
System & Database Administration
Contributed to PostgreSQL, Terraform & YAWL
PostgreSQL evangelist
May 16-17 2018
About Me

So I like to think I know Data...
May 16-17 2018

The story, all names, characters, and incidents
portrayed in this production are fictitious. No
identification with actual persons (living or
deceased), places, buildings, and products is
intended or should be inferred.
Franchise coffee shops
Our hero, a lowly Head of Systems Engineering is
faced with the epic quest of moving to the cloud
May 16-17 2018
Our Hero’s Epic Quest

Use cloud as spare/batch capacity
Duplicate existing estate in the cloud
Brave New World
- Greenfield development
- “Version 2.0”
May 16-17 2018
Approaching Cloud Migration

Direct mapping of existing infrastructure to the cloud
- Load balancers become Elastic Load Balancers
- SANs become Buckets or Elastic File Systems
Minimal operational change required
- Everything is the same just in a new location
Perceived as a “quick win” to cloud adoption
- Little AWS/GCP/Azure specific knowledge required
May 16-17 2018
The Appeal of a Lift & Shift

We’re changing only where our hardware is
- Operationally no different then the past
- Instance size based on current hardware size
- No change to deployment process
Under utilisation of resource
- Still paying for excess capacity
Stunted scalability
- We can throw more virtual hardware at it
- Add additional node behind load balancers
May 16-17 2018
The Penalty of a Lift & Shift

Our hero has a new CTO
Recognises that we’re just moving our problems
“We’re under-investing in the future”
May 16-17 2018
Brave New World

No “legacy” baggage
Free reign for experimentation
Perceived as a “low risk” path to cloud adoption
- If it doesn’t work, switch it off
- “No risk” to existing production environment
May 16-17 2018
The Appeal of a Brave New World

Organisationally isolated
- Limited impact to existing practices
- Leads to a “Us vs. Them” mentality
Focus is usually on application functionality with infrastructure seen as a necessity
Project has a high risk of failure
- Care free scoping leads to an unfocused project
- Significant time can be lost to integrating with the old world
May 16-17 2018
The Penalty of a Brave New World

Are we just building a traditional but virtual data centre?
- Lift & Shift is operationally the same
- Brave New World isn’t part of the Real World
How are we leveraging the power of a dynamic infrastructure?
Our infrastructure is scalable, but is the application?
May 16-17 2018
Are we really “doing cloud”?

This is not a new problem
How do we move on from our
comfortable past?
May 16-17 2018
Breaking the Mould

Conway’s law states you’re doomed to design your
organisational structure
May 16-17 2018
● Conway’s Law:
“Organisations which design
systems … are constrained to
produce designs which are copies
of the communication structures of
these organisations”
- Melvin Conway, 1967
Breaking the Mould

Scaling of software isn’t just the same elements
bigger, it’s an increase in different elements that
interact in a nonlinear fashion. Complexity of the
whole increases much more than lineraly.
May 16-17 2018
● No Silver Bullet:
“A scaling-up of a software entity is
not merely a repetition of the same
elements in larger size; it is
necessarily an increase in the
number of different elements. In
most cases, the elements interact
with each other in some nonlinear
fashion, and the complexity of the
whole increases much more than
linearly.”
- Fred Brooks Jr., 1986
Breaking the Mould

Applying existing patterns at best misses out on possible improvements with new
technology and at worst it adds more complexity.
May 16-17 2018
● Infrastructure as Code
“In many cases, applying existing
patterns will, at best, miss out on
opportunities to leverage newer
technology to simplify and
improve the architecture. At
worst, replicating existing
patterns with the newer platforms
will involve adding even more
complexity.”
-Kief Morris, 2016
Breaking the Mould

Systems should work correctly even in the face of adversity
May 16-17 2018
● Designing Data-Intensive Applications:
“The system should continue to work
correctly (performing the correct
function at the desired level of
performance) even in the face of
adversity (hardware or software faults,
and even human error).”
- Martin Kleppmann, 2017
Breaking the Mould

Our hero needs a different approach
May 16-17 2018
●
●
A Different Approach
●
●

The more you care about individual
things the more they will hold your
attention
In a truly scalable environment you
should only care about the combination of
many individual things
May 16-17 2018
Attitude
The attitude you have to your
environment will determine the
limits of your scalability
●

You treat your servers like pets
- You give them names (igloo, husky, snowshoe)
- You give them homes (racks on site or co-located)
- If they fail, you do everything you can to save them
Every server is an investment
- Often the best hardware that can be afforded
- Amortised over years
- Excess capacity to allow for growth
Provisioning new servers takes weeks
May 16-17 2018
Attitude: Living in the Iron Age

You treat your servers like cattle
- They have identifiers
- You care only where they are geographically
- If they fail, you put them down and get a new one
Your architecture is your investment
- Configuration is chosen for your current load
- Pay for what you use
- Capacity can be added when required
Provisioning new servers takes seconds
May 16-17 2018
Attitude: Living in the Cloud Age

Are we simply herding our pets?
- In a Lift & Shift this is almost certainly so
- Scaling groups is a start but it is not the end
How are we managing our virtual servers?
- Complex cloud-init scripts?
- Traditional configuration management?
May 16-17 2018
Attitude: Is Pets v Cattle enough?
vs

Everything is a package and can be discarded
You treat your servers like single use products
- They’re pre-packaged for a particular purpose
- If they fail, you toss it away and grab another
You automate everything
Never make a manual change
May 16-17 2018
Attitude: The Disposable Infrastructure

(slide 1 of 2)
Repeatability brings reliability and predictability
Defining a build pipeline:
- Ensures the same process is followed for every change
- Provides an audit trail for every change
- Gives visibility of your value stream
May 16-17 2018
Be Continuous
Continuous integration and
delivery is a must

(slide 2 of 2)
Your developers probably already practice CI
- It is the standard for code development
- The output of CI can be the start of CD
Continuous delivery doesn’t have to mean continuous deployment
- Build pipelines can have approval stages
- Every change should be deployable
May 16-17 2018
Be Continuous
Continuous integration and
delivery is a must

Many applications expect a static infrastructure
- Hard-coded assumptions that an IP address won’t change once an application is
started
Many applications are cluster unaware
- Sticky sessions on load balancers can help
- Some protocols don’t load balance well
May 16-17 2018
Refactoring to the Cloud
Your applications need to be
(re)built to fit a dynamic
infrastructure

Refactor to contemporary architectural approaches
- Service Oriented Architectures & Microservices
- Transition from stateful services to stateless
Package everything using distribution packagers
- The output of your build pipeline is a RPM/DEB
- Your $CM_TOOL already supports this
Chose a deployment strategy
-Machine images vs. containers
May 16-17 2018
Adopting Contemporary Approaches

Fear not vendor lock in, savings are to be reaped leveraging commodity services
Use SQS instead of automating the installation and configuration of a message
broker and accepting the operational burden of maintaining it
Careful abstraction of the API will allow porting to a different platform if absolutely
necessary
May 16-17 2018
Fear not Vendor Lock-In

(slide 1/2)
Design the infrastructure in parallel to the cloud aware application changes
Mandate every instance is part of a scaling group to enforce cluster awareness
Use the same principles for infrastructure development as you use for applications
May 16-17 2018
Infrastructure is Code
Dynamic infrastructure must
be treated as a first class
citizen in any cloud project

(slide 2/2)
Script/encode everything unless there is no API/tooling support
Deploy the same infrastructure in development, test and production environments
- Sizing can be parameterised
Your deployment pipeline becomes the assembly of application packages and
infrastructure configuration
High cohesion and loose coupling applies to infrastructure as much as it does to
applications
May 16-17 2018
Infrastructure is Code
Dynamic infrastructure must
be treated as a first class
citizen in any cloud project

If it can go wrong, it will go wrong so
think in terms of when and not if
Treating our infrastructure and its hosted
applications as disposable in conjunction
with CD eliminates a number of failure
scenarios
May 16-17 2018
Planning to fail
Planning to fail will lead to
success

(slide 1/3)
Regularly test your disposability
- Terminate instances at random to ensure resiliency
- Block all network access to an instance
- Chaos Monkey & the Simian Army
- Trigger failovers for less disposable services
Constantly churning disposable instances helps prevent configuration drift
May 16-17 2018
Planning to fail

(slide 2/3)
Availability and durability cost
Identify points of failure and assess:
- How often will this failure occur?
- How do I mitigate this failure?
- How do I test this failure to ensure mitigation?
- Is the cost of mitigation worth the customer impact during failure?
May 16-17 2018
Planning to fail

(slide 3/3)
Be honest in assessing the worth of your business
- Do you really need to double your costs to run in multiple regions?
- Trello, Slack & many other high profile companies – including Amazon - were
affected by the S3 outage
May 16-17 2018
Planning to fail

Test the durability of your data
- User error is your biggest risk
- - “I forgot the WHERE clause”
- - “I thought I was in the test environment”
Regularly exercise data loss & recovery scenarios in development and test
environments
Make back-ups and regularly test they restore
- Consider storing backups in both S3 & Google
- Store backups in multiple regions
If you don’t want a full ELK stack at least ship log files to CloudWatch or Stackdriver
May 16-17 2018
Data is not Disposable
Data is not disposable and is
probably more important
than your availability

Multiple backup strategies, all failed
Multiple failures, same engineers, too much pressure, too tired, mistakes made
May 16-17 2018
https://about.gitlab.com/2017/02/10/postmortem-of-database-
outage-of-january-31/
A Lesson to Learn From

Jenkins solves all our problems!
AWS solves all our problems!
Docker solves all our problems!
Kubernetes solves all our problems!
May 16-17 2018
Tooling is Not The Answer
Tooling is not the answer
but it is part of an
automated solution

Let us assume we have a front end web application which places orders in a queue for
subsequent asynchronous fulfilment by a separate application backed by a database.
We’ve already refactored our applications for the cloud.
We will have a CI pipeline for the applications, the output being AMI images
A separate CD pipeline executes infrastructure code and rolls out the new AMIs
Goal is to promote infrastructure and AMIs between environments
May 16-17 2018
Remember Our Hero?

Can create many different machine images
Consider creating a base image to control OS updates
Use normal configuration management tools
- Support for Ansible, Chef & Puppet
- Can just write shell script if you must
Use placeholders for configuration to be filled by launch scripts
May 16-17 2018
https://packer.io
Packer

Source our code from a repo, build and test
Package our application as a DEB or RPM
Place our artifact into a S3 repository
Run Packer to generate a new AMI
May 16-17 2018
Application Pipeline

Declarative language for the construction of infrastructure
Supports all major vendors
State can be stored in buckets to facilitate sharing
Separate out infrastructure layers
- Minimises blast radius of changes
- Keep persistent apart from disposable
May 16-17 2018
https://terraform.io
Terraform

Triggered by new AMIs or Terraform code changes
Apply Terraform to update the infrastructure
Run integration tests to verify application build
Wait for approval before promotion to next environment
May 16-17 2018
Infrastructure Pipeline

Any instance can be terminated
Resilient to zone failure
Cross-region read replica allows DR for region failure
- Just need to run Terraform in the region to add the instances when required and
update Route 53
May 16-17 2018
Deployed Infrastructure

May 16-17 2018
● Have attitude
● Be continuous
● Refactor to the Cloud
● Infrastructure is code
● Plan to fail
● Data is King
● Tooling is not The Answer
Summary

May 16-17 2018
Questions?
Mike Fowler
gh-mlfowler
mlfowler
mike dot fowler at claranet dot uk

Leveraging Automation for a Disposable Infrastructure

Leveraging Automation for a Disposable Infrastructure

Recommended

Recommended

More Related Content

Similar to Leveraging Automation for a Disposable Infrastructure

Similar to Leveraging Automation for a Disposable Infrastructure (20)

More from Mike Fowler

More from Mike Fowler (16)

Recently uploaded

Recently uploaded (20)

Leveraging Automation for a Disposable Infrastructure