This document provides guidance on maintaining sanity when using infrastructure as code (IaC). It discusses why IaC is used, compares developer and operations mindsets, and explains how DevOps helps. It emphasizes designing IaC declaratively using modules and planes of responsibility. Changes should be tested across environments before production. Automation and monitoring are key to reduce risk and ensure stability. Following these practices allows for more predictability, less stress, and improved uptime.
3. About Dewey
Distributed Application Developer for 25 years
− Doing build/release/software process for
about that long
− Accidentally doing DevOps out of self-defense
Wandered into operations about 7 years ago
− Built some private cloud for dev
− Built some private cloud for prod
− Moved to public cloud architecture
4. Deployment Context
Largest Deployment:
− ~64 application servers
− ~96 MongoDB nodes
− Many postgres, S3, ...
− ~14,000 TPS
Smallest Deployment:
− 1 application server
− 1 DB
− 1-2 TPS
5. Where did all this come from?
I'm a developer
In doing IaC, I noticed some things really didn't
work well
− Some “developer” assumptions about Ops
didn't work
− Ops was missing some lessons developers
have learned
IaC patterns are a work in progress
− There are definitely some loose ends -- this is
*NOT* the product of 20 years of industry
consensus
6. Why IaC
Manage vastly increased complexity
Reduce risk
Understand changes before they hit production
− create a low risk location to try changes
− helps developers, too!
DR anyone?
7. Why do you
care so much
about Sanity?
UPTIME!
Sleep!
Because you're in this for the long run
8. Dev vs. Ops
Dev
− “I can make this faster!”
− “I can make this better!”
− “This will only take a minute...”
Enthusiasm!
9. Dev vs. Ops
Ops
− “Don't break anything!”
− “Don't lose data!”
− Alan Shepard's Prayer
Pessimism
10. DevOps
We know this will work
We can repeat this
We develop to make ops easier
We operate to make coding easier
Confidence!
13. IaC is like program code...
Controlled changes
Build outside of user view
You can have “good code” and “bad code”
There are patterns and anti-patterns
14. IaC is unlike
program code...
Programs describe HOW. IaC describe WHAT
Behavior is something that happens within the
infrastructure
Infrastructure is more difficult to test than
programs
− Nearly every test is an “integration test”
It's hard to isolate the pieces
16. What's important in IaC
IaC is not about speeding up your departure,
it's about speeding up your arrival
You need to map your technical concepts to
your thought processes
17. Consider your
end goal
You need a running system
But you're going to be changing and deploying
new running systems
And you need to keep it running
Maintenance
Window
Continuous Up-time
19. Be Declarative!
It's very difficult to reason about the state of an
infrastructure that is managed by procedures
− Procedural thinking makes “is” a 2nd
class
concept
Incidentally, developers seem to love
procedures and hate state declarations
20. Procedures
Are ultimately necessary
Should be as far down in
the process as possible
− Cloud Formation and
Terraform do this
Should be idempotent
− EnsureWebServer(),
not
createWebServer()
− Run again and again
and again
21. Bad!
db = createDatabase()
web = createWebServer(db)
addDNS(“www...”, web)
Because if you run it again, you get a 2nd
DB,
web server, ...
23. Best
# pesudo-code
resource “DB” {…}
resource “Web” {…; database=db}
resource “DNS” { name=”www...”; IP=web.ip }
Why? Because the tool can understand what
you want and figure out how to get it
− CloudFormation, Terraform, Puppet
25. Make it Modular
Program code has object and functions
IaC should be modular, too
− Re-use
− Easy modification
− Consistent definitions
Make your modules semantically meaningful,
not functionally meaningful
26. Test and Verify
You wouldn't deliver an untested program
Don't deliver an untested infrastructure
But how?
27. Testing
You have a monitoring system, right?
Guess what...
− testing “Is” is just the first part of monitoring
− Testing “does” is the 2nd part
This gives you...Test Driven Development!
− You know when you're complete
− You know if something breaks
This means your monitoring system is not an
afterthought, it's a forethought!
31. “Full Stack” thinking is
hazardous
Full Stack is difficult to change post deployment
It's difficult to deal with cross-cutting concern
across multiple stacks
Different parts of the stack are often maintained
by different people
− In large companies, sometimes by different
groups
− In any group over a single developer,
someone will always know more about part of
the system
32. Consider instead: Planes
Instead of a “stack”, think of “planes”:
− Physical
− Network
− Persistence
− Service
− Application
− Control
33. Planes
Similar to the OSI network model
Allows clear separation of responsibilities and
concerns
Planes support other planes
Your individual design may move a functionality
to a different plane. They key is to handle it
consistently
35. Network Plane
Enables and controls communication
Location for computation and storage
Maybe handles communication security
36. Persistence Plane
This is a special plane – the only one you can't
burn down
Holds your critical data
If the company will fold if this data is lost, it's in
the Persistence plane.
Backups and non-recreatable data
37. Services Plane
Commonly available services
− DNS, VPN
− Service Location (consul? Etcd? DNS?)
− Secrets Management
− User Authentication/Management (maybe)
38. ...but what about Databases?
DBs might be part of the “Service” plane
You might consider DBs part of the application
instead
You probably don't want to see them as
“Persistence” unless you're betting your
company that they'll never go down
39. Application Plane
This is the part that touches the users!
− Application Servers
− Micro-services
− Web farm
− DBs (maybe)
Modifying Application plane often modifies state
in the other planes
− e.g. DNS, consul, ...
40. Control Plane
Global Procedures
− Backups
− Batch Jobs
− State changes
− Infrastructure Roll-out
− Recovery
Often your cloud system provides part of this
implicitly
“Scale” is often set here – ability to scale is not
42. Must
Have a separate, isolated, no-risk infrastructure
development area
Have a way to verify functionality
Use a source control tool
43. Must Not
Develop in Production!
− Yeah, yeah, everyone
knows this
− but “Can you just make
this one little change...”
Note: If your process is good,
you can make “one little
change” in the process, not
work-around the process.
44. Should
Have an automated (not necessarily automatic)
IaC roll-out process
Frequently “set up from scratch”
45. Should Not
Develop “near” production
− You don't want accidents
to impact production –
build safeties!
Have manual steps
− A little bit of friction has
impact FAR out of
proportion to it's size
Did you forget it?
Did you miss a step?
46. Write the way you
think
We think about infrastructure as objects
− “There is a network”
− “Here is a server”
− “The load balancer has these instances”
We don't think about infrastructure as
procedures
So...our tools should be work the same way
47. Dealing with
Component
Dependencies
“Just make sure the DB is up before the app.”
“Oh, and auth needs to be up before the web
farm”
“Right, logging needs to be up before
everything..."
This way lies madness (and fragility!)
− Failures cascade
− Recovery doesn't
48. Cascading Recovery
Health Checks
Retry and Wait
− If it's not up, keep trying
This gives you...
− Scalability!
− Recovery time!
− Self Healing!
Why? Because very few people are at their
best at 3am
49. What if we don't have this already?
Wrap the component and make it
Don't be afraid of building better pieces out of
what developers give you
We want to make recovery cascade
Incidentally, this makes deployment a lot
easier...and faster
50. Service Discovery
Keep Service Discovery as local as possible
− Use cluster-local discovery instead of e.g.
global DNS
− Consul, Etcd, Private DNS zones
Keep service names as unchanged as possible
− Set the context of the service, don't configure
the pieces
Bind external services as late as possible
− Map the “www...” name in at the end
51. Writing the Code
Use inspection instead of variable passing
between planes
Use variable passing rather instead of hard-
coding
Comment your code!
− The code shows “how”, you need to comment
on “why”
Use good commit messages!
− Make it easy to find when and where changes
are made
52. Rolling Out
Roll out changes to lower risk environments
− Dev Environment
− QA Environment
− Staging
− Prod
Wow, that's a lot of environments to manage
But you have IaC, so it's easy!
53. More Roll-out
Have some Canaries if possible
− Put a little traffic on the new system
− Be ready to take traffic off
− Often not possible
54. Roll back vs Roll Forward
Either works, but make a decision and stick
with it
If you're rolling forward, make sure you can roll
forward to a previous (working) state
Roll forward is easier (and faster) for
development
55. The Control Plane
What you use to control the behavior of the
other planes.
Execute and control backups
Add/Remove Users, etc
Run any batch jobs you need (data purge?)
56. A note about scheduled jobs...
Cron is Evil
− Well, not actually
evil, but...
− Hard to monitor
− Hard to view results
− Hard to modify
− Requires sysadmin knowledge to change
Much better to have a single location with a UI
57. Run-books
Sufficiently detailed documentation is
executable
− Anything you do regularly should be scripted
− Failure recovery should be as automated as
possible
because downtime is bad
and thinking under pressure is harder
So, what's left is troubleshooting and problems
you don't yet know or understand
...which are difficult to Run-book
58. Tying it up
Describe your IaC as declaratively as possible
Develop your infrastructure in a separate
location
Organize your IaC into planes of responsibility
Build your IaC out of modules
Deploy your changes across environments
Automate all of your normal operations
“Is” is the 1st question we ask, only after we've established “is” do we proceed to “does”
Semantically meaningful: what you talk about when describing the architecture.
“The app server”, “the cache server”
Not “a linux server that runs NGINX and Jetty to serve a Java web application”