Infrastructure as Code to Maintain your Sanity

Staying sane with IaC
Dewey Sasser
Principal Consultant
Aligned Software

Contents

Where I'm coming from

IaC

DevOps

Sanity

How To...

About Dewey

Distributed Application Developer for 25 years
− Doing build/release/software process for
about that long
− Accidentally doing DevOps out of self-defense

Wandered into operations about 7 years ago
− Built some private cloud for dev
− Built some private cloud for prod
− Moved to public cloud architecture

Deployment Context

Largest Deployment:
− ~64 application servers
− ~96 MongoDB nodes
− Many postgres, S3, ...
− ~14,000 TPS

Smallest Deployment:
− 1 application server
− 1 DB
− 1-2 TPS

Where did all this come from?

I'm a developer

In doing IaC, I noticed some things really didn't
work well
− Some “developer” assumptions about Ops
didn't work
− Ops was missing some lessons developers
have learned

IaC patterns are a work in progress
− There are definitely some loose ends -- this is
*NOT* the product of 20 years of industry
consensus

Why IaC

Manage vastly increased complexity

Reduce risk

Understand changes before they hit production
− create a low risk location to try changes
− helps developers, too!

DR anyone?

Why do you
care so much
about Sanity?

UPTIME!

Sleep!

Because you're in this for the long run

Dev vs. Ops

Dev
− “I can make this faster!”
− “I can make this better!”
− “This will only take a minute...”
Enthusiasm!

Dev vs. Ops

Ops
− “Don't break anything!”
− “Don't lose data!”
− Alan Shepard's Prayer
Pessimism

DevOps

We know this will work

We can repeat this

We develop to make ops easier

We operate to make coding easier
Confidence!

The Development Cycle

Conceive

Plan

Build
− Develop
− Test

Verify
− Acceptance Test

Deploy

Manage

IaC code is different than program
code

IaC is like program code...

Controlled changes

Build outside of user view

You can have “good code” and “bad code”

There are patterns and anti-patterns

IaC is unlike
program code...

Programs describe HOW. IaC describe WHAT

Behavior is something that happens within the
infrastructure

Infrastructure is more difficult to test than
programs
− Nearly every test is an “integration test”

It's hard to isolate the pieces

OK, let's get to that Sanity Part...

What's important in IaC

IaC is not about speeding up your departure,
it's about speeding up your arrival

You need to map your technical concepts to
your thought processes

Consider your
end goal

You need a running system

But you're going to be changing and deploying
new running systems

And you need to keep it running
Maintenance
Window
Continuous Up-time

Be Declarative!

It's very difficult to reason about the state of an
infrastructure that is managed by procedures
− Procedural thinking makes “is” a 2nd
class
concept

Incidentally, developers seem to love
procedures and hate state declarations

Procedures

Are ultimately necessary

Should be as far down in
the process as possible
− Cloud Formation and
Terraform do this

Should be idempotent
− EnsureWebServer(),
not
createWebServer()
− Run again and again
and again

Bad!
db = createDatabase()
web = createWebServer(db)
addDNS(“www...”, web)

Because if you run it again, you get a 2nd
DB,
web server, ...

Better
db = ensureDBExists()
web = ensureWebExists(db)
ensureDNSName(“www...”, web)

Best
# pesudo-code
resource “DB” {…}
resource “Web” {…; database=db}
resource “DNS” { name=”www...”; IP=web.ip }

Why? Because the tool can understand what
you want and figure out how to get it
− CloudFormation, Terraform, Puppet

Idempotence
Not idempotent:
Idempotent:
startServer() {
nginx -g "/var/run/nginx.pid;"
}
startServer() {
pid=$(cat /var/run/nginx.pid)
if ! processIsRunning $pid ; then
nginx -g "/var/run/nginx.pid;"
fi
}

Make it Modular

Program code has object and functions

IaC should be modular, too
− Re-use
− Easy modification
− Consistent definitions

Make your modules semantically meaningful,
not functionally meaningful

Test and Verify

You wouldn't deliver an untested program

Don't deliver an untested infrastructure
But how?

Testing

You have a monitoring system, right?

Guess what...
− testing “Is” is just the first part of monitoring
− Testing “does” is the 2nd part

This gives you...Test Driven Development!
− You know when you're complete
− You know if something breaks

This means your monitoring system is not an
afterthought, it's a forethought!

“Full Stack” thinking is
hazardous

Full Stack is difficult to change post deployment

It's difficult to deal with cross-cutting concern
across multiple stacks

Different parts of the stack are often maintained
by different people
− In large companies, sometimes by different
groups
− In any group over a single developer,
someone will always know more about part of
the system

Consider instead: Planes

Instead of a “stack”, think of “planes”:
− Physical
− Network
− Persistence
− Service
− Application
− Control

Planes

Similar to the OSI network model

Allows clear separation of responsibilities and
concerns

Planes support other planes

Your individual design may move a functionality
to a different plane. They key is to handle it
consistently

Physical Plane

Provided by AWS (or other cloud provider)

Or a rack of systems

Network Plane

Enables and controls communication

Location for computation and storage

Maybe handles communication security

Persistence Plane

This is a special plane – the only one you can't
burn down

Holds your critical data

If the company will fold if this data is lost, it's in
the Persistence plane.

Backups and non-recreatable data

Services Plane

Commonly available services
− DNS, VPN
− Service Location (consul? Etcd? DNS?)
− Secrets Management
− User Authentication/Management (maybe)

...but what about Databases?

DBs might be part of the “Service” plane

You might consider DBs part of the application
instead

You probably don't want to see them as
“Persistence” unless you're betting your
company that they'll never go down

Application Plane

This is the part that touches the users!
− Application Servers
− Micro-services
− Web farm
− DBs (maybe)

Modifying Application plane often modifies state
in the other planes
− e.g. DNS, consul, ...

Control Plane

Global Procedures
− Backups
− Batch Jobs
− State changes
− Infrastructure Roll-out
− Recovery

Often your cloud system provides part of this
implicitly

“Scale” is often set here – ability to scale is not

Must

Have a separate, isolated, no-risk infrastructure
development area

Have a way to verify functionality

Use a source control tool

Must Not

Develop in Production!
− Yeah, yeah, everyone
knows this
− but “Can you just make
this one little change...”
Note: If your process is good,
you can make “one little
change” in the process, not
work-around the process.

Should

Have an automated (not necessarily automatic)
IaC roll-out process

Frequently “set up from scratch”

Should Not

Develop “near” production
− You don't want accidents
to impact production –
build safeties!

Have manual steps
− A little bit of friction has
impact FAR out of
proportion to it's size

Did you forget it?

Did you miss a step?

Write the way you
think

We think about infrastructure as objects
− “There is a network”
− “Here is a server”
− “The load balancer has these instances”

We don't think about infrastructure as
procedures

So...our tools should be work the same way

Dealing with
Component
Dependencies

“Just make sure the DB is up before the app.”

“Oh, and auth needs to be up before the web
farm”

“Right, logging needs to be up before
everything..."

This way lies madness (and fragility!)
− Failures cascade
− Recovery doesn't

Cascading Recovery

Health Checks

Retry and Wait
− If it's not up, keep trying

This gives you...
− Scalability!
− Recovery time!
− Self Healing!

Why? Because very few people are at their
best at 3am

What if we don't have this already?

Wrap the component and make it

Don't be afraid of building better pieces out of
what developers give you

We want to make recovery cascade

Incidentally, this makes deployment a lot
easier...and faster

Service Discovery

Keep Service Discovery as local as possible
− Use cluster-local discovery instead of e.g.
global DNS
− Consul, Etcd, Private DNS zones

Keep service names as unchanged as possible
− Set the context of the service, don't configure
the pieces

Bind external services as late as possible
− Map the “www...” name in at the end

Writing the Code

Use inspection instead of variable passing
between planes

Use variable passing rather instead of hard-
coding

Comment your code!
− The code shows “how”, you need to comment
on “why”

Use good commit messages!
− Make it easy to find when and where changes
are made

Rolling Out

Roll out changes to lower risk environments
− Dev Environment
− QA Environment
− Staging
− Prod

Wow, that's a lot of environments to manage

But you have IaC, so it's easy!

More Roll-out

Have some Canaries if possible
− Put a little traffic on the new system
− Be ready to take traffic off
− Often not possible

Roll back vs Roll Forward

Either works, but make a decision and stick
with it

If you're rolling forward, make sure you can roll
forward to a previous (working) state

Roll forward is easier (and faster) for
development

The Control Plane

What you use to control the behavior of the
other planes.

Execute and control backups

Add/Remove Users, etc

Run any batch jobs you need (data purge?)

A note about scheduled jobs...

Cron is Evil
− Well, not actually
evil, but...
− Hard to monitor
− Hard to view results
− Hard to modify
− Requires sysadmin knowledge to change

Much better to have a single location with a UI

Run-books

Sufficiently detailed documentation is
executable
− Anything you do regularly should be scripted
− Failure recovery should be as automated as
possible

because downtime is bad

and thinking under pressure is harder

So, what's left is troubleshooting and problems
you don't yet know or understand

...which are difficult to Run-book

Tying it up

Describe your IaC as declaratively as possible

Develop your infrastructure in a separate
location

Organize your IaC into planes of responsibility

Build your IaC out of modules

Deploy your changes across environments

Automate all of your normal operations

Results

Less stress

Less risk

More predictability

Image Credits
All images discovered by Google Images set for "Labeled for reuse"
https://commons.wikimedia.org/wiki/File:Cable_closet_bh.jpg
https://commons.wikimedia.org/wiki/File:CERN_Server_03.jpg
https://commons.wikimedia.org/wiki/File:Rugged_1U_Computer.png
https://oer.gitlab.io/oer-on-oer-infrastructure/Git-introduction.html#/sec-title-slide
https://commons.wikimedia.org/wiki/File:Hair_pulling_stress.jpg
https://commons.wikimedia.org/wiki/File:Software_Developer_at_work_03.jpg
https://commons.wikimedia.org/wiki/File:National_Security_Operations_Center_photograph,_c._1985_-
_National_Cryptologic_Museum_-_DSC07661.JPG
https://commons.wikimedia.org/wiki/File:Devops-toolchain.svg
https://commons.wikimedia.org/wiki/File:Devops.svg
https://www.flickr.com/photos/thebusybrain/2492945625
https://pixnio.com/people/female-women/woman-programmer-internet-business-blogging-business-coding-computer-programming
https://en.wikipedia.org/wiki/File:DMZ_network_diagram_1_firewall.svg
https://commons.wikimedia.org/wiki/File:Gnome-emblem-important.svg
https://www.flickr.com/photos/davedugdale/5026217210
https://commons.wikimedia.org/wiki/File:Exclamation_mark_red.png
https://www.flickr.com/photos/oskay/2156889157
https://commons.wikimedia.org/wiki/File:Concrete_Compression_Testing.jpg
https://www.xymon.com/ (screen shot)
https://pixabay.com/en/photos/stabilit%C3%A4t/
https://www.flickr.com/photos/fdecomite/2335204025
https://www.flickr.com/photos/internetarchivebookimages/14777225344
https://www.flickr.com/photos/102642344@N02/14960581044
https://pixabay.com/en/socket-concrete-slab-underground-2828305/
https://commons.wikimedia.org/wiki/File:Inside_Suite.jpg
https://commons.wikimedia.org/wiki/File:Ego_network.png
https://commons.wikimedia.org/wiki/File:MUTCD_R3-7R.svg

Image Credits
https://commons.wikimedia.org/wiki/File:Philippines_road_sign_R3-14P.svg
https://www.flickr.com/photos/dullhunk/7214525854/
https://commons.wikimedia.org/wiki/File:CALTRANS_SR39A_(CA).svg
https://pixabay.com/en/thinker-words-thoughts-mind-white-3025789/
https://commons.wikimedia.org/wiki/File:Acyclic_dependencies,_circular_dependency_example.svg
http://phdthesis-bioinformatics-maxplanckinstitute-molecularplantphys.matthias-scholz.de/
https://de.wikipedia.org/wiki/Datei:Usb_otg.jpg
https://commons.wikimedia.org/wiki/File:Copyright_Card_Catalog_Drawer.jpg
https://pxhere.com/en/photo/891776
https://commons.wikimedia.org/wiki/File:Discovery_rollout_ceremony.jpg
https://pixabay.com/en/photos/pause/?image_type=vector
https://pixabay.com/en/control-panels-controls-equipment-1840480/
https://commons.wikimedia.org/wiki/File:Jenkins_Home.png
https://en.m.wikipedia.org/wiki/File:Bottle_Sling_ABOK_1142_Tying_Complete.jpg
https://skitterphoto.com/photos/2188/girl-taking-a-picture
https://pixabay.com/en/jenga-balance-sensitivity-stability-1941500/
https://pixabay.com/en/emoji-smilie-whatsapp-emotion-2762568/

Infrastructure as Code to Maintain your Sanity

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Similar to Infrastructure as Code to Maintain your Sanity

Similar to Infrastructure as Code to Maintain your Sanity (20)

Recently uploaded

Recently uploaded (20)

Infrastructure as Code to Maintain your Sanity

Editor's Notes