Watch this talk here: https://www.youtube.com/watch?v=L__8o02od6Q
For an example of the code we used in our CI pipeline to make a Chef Environment from a Berksfile.lock - check out this project:
https://github.com/petecheslock/berks2env
One of the biggest advantages of Chef is it's flexibility, allowing you to customize it at-will to fit your infrastructure needs. While this makes Chef incredibly powerful, it can also be challenging to develop a workflow to manage the day-to-day usage of chef.
Should I use a single repo for all my cookbooks?
One cookbook per repo?
Berkshelf?
Librarian?
Test-Kitchen?
Where does Jenkins(CI) fit it?
What about Testing?
How does this work with my small team? What about my large team? What about my * Distributed Team?
Over the past few years I have been a part of two distinct Chef workflows that take opposite paths about how to solve issues around collaboration, versioning, testing, etc. During the course of this talk I will share:
Details about the requirements that lead us down these 2 paths.
What worked.
What didn't.
How we use many of the tools available to safely test code changes.
How we deploy cookbook changes safely and quickly (and keep uptime our highest priority).
4. Who Am I?
Pete Cheslock
Currently - Rabble Rouser at Dyn
!
Previously at Sonian - one of the very early Opscode Chef™
Customers (probably?). Also Sensu.
5. Disclaimer
WARNING: THIS TALK FEATURES TWO CRAZY ASS WAYS
YOU CAN USE CHEF AND IS INTENDED FOR A MATURE
AUDIENCE. PETE CHESLOCK DOES NOT CONDONE THE
WORKFLOWS USED AND DISCOURAGES ANYONE FROM
ATTEMPTING THEM.
6. Disclaimer
WARNING: THIS TALK FEATURES TWO CRAZY ASS WAYS
YOU CAN USE CHEF AND IS INTENDED FOR A MATURE
AUDIENCE. PETE CHESLOCK DOES NOT CONDONE THE
WORKFLOWS USED AND DISCOURAGES ANYONE FROM
ATTEMPTING THEM.
THIS TALK MAY ANGER YOU - I’M HERE IF YOU NEED
A HUG AFTERWARDS
13. Environments
Databags
Roles are good
Roles are bad
WTF is a Berkshelf?
Librarian?
Chef Server
Chef Zero
Vagrant-Berkswhat?
Hosted Chef
LWRPs
Don’t Use Definitions!
Definitions are Awesome!
15. Sonian
Founded 2008
2008 AWS Startup Challenge Finalist
I joined in 2009
Very early Chef user - Originally with
Puppet (before Opscode existed)
Pre-Databags, Roles, etc, etc.
Massive growth in short time - reaching
100’s of TB’s of ElasticSearch and well
over a PB of S3 Storage.
19. Soon - business started to pick
up - very quickly.
Speed picked up, things moved
fast and we broke stuff
20. Soon - business started to pick
up - very quickly.
Speed picked up, things moved
fast and we broke stuff
To close some deals we had
contracts signed that would
limit when we could push
changes to the systems.
24. Now imagine that scenario with 20 environments - Each
environment living either on AWS, Rackspace Cloud, HP Cloud or
IBM “SmartCloud”
Each environment has a different contracted deployment schedule.
I know what you are thinking - system changes aren’t a “deploy” -
well next time I’ll bring you to meet with the lawyers on that.
25. How did this work in practice?
In the past we’d push a small change to Prod - everything would
break terribly. Lots of technical debt - scenarios that no one could
ever believe could happen
This is email archiving - in some cases customers would have mail
forwarded to us via their mail server. We CAN NOT drop that
mail. If they are audited and we are proven to be missing data -
that is really, really bad. Srs super bad.
26. We liked our single Chef-repo
Every Story had Branch- and we got into the cycle of commit,
merge, push and test
Represented our pre-prod environments as branches in git - using
some internal tooling to manage.
27. eng-9999
HEAD
(master)
QA
(Daily)
Dev
(Daily)
Cut a new branch
from Master
Developer adds
commits and test
locally
Developer merges to
dev branch for dev
testing
If things “work”
and nothing breaks -
merge to QA
If it passes
regression testing -
merge into master
(with others)
28. • roles/stack.rb
• base.rb
• nonprod.rb
• cloud.rb (ec2, rackspace)
• roles/application.rb
• application.rb
• service.rb
• etc.rb
“Hold on a minute. I’m
just going to push this small
change to this one role.”
It’s roles all the way down
29. We got burned all the time.
“Move Fast and Break Everything”
Needed something that worked for
today & the future
Let’s create a Git branching
strategy!
30. Wut?
I know.
Seriously. I know.
We were trying to answer this one question.
“How do you version the cookbooks, roles,
and databags as one singular asset.”
38. That sounds overly complex
We has some git experts - and it
leveled up all our game.
Extensive tooling around our
branching strategy.
We were Release Engineering.
41. So What Happened?
It actually worked.
Not only that - it really worked well.
20+ Stacks, upgrading 4 per night (6pm to 12pm if you are lucky)
Before “Deploy Week” - we deployed all the time - and things broke
all the time.
42. Over the course of about 12 months we went from:
Deploy whenever - things break randomly (little testing)
Create a multi-page deploy checklist of mostly manual items
“Deploy Week” - 20 Stacks over 5 days (6pm to 12am - hopefully)
“Deploy Day” - 20 Stack over one night - 6pm to 9pm
“Deploy Day” - Saturday (contracts) - Best time was 20+ stacks ~1 hour
43. Deploys were drama free
They were drama free because we tested all the pieces that changes
together. And not just unit and integration testing, but full on
regression testing and user acceptance testing.
DataBags, Roles, Cookbooks, Application Code - It all moved together.
Tooling was built to support the support team (who eventually did the
deploys)
High communication and tight teamwork allowed this to work.
44.
45. “If I could do it all over again I would do it very differently”
46. Dyn
Incorporated in 2001, Dyn’s
global presence services more than
four million enterprise, small
business and personal customers.
We specialize in Traffic Management
& Message Management
I joined early in 2013 to run the
System Automation and Release
Engineering Team
(We call it DevTools)
53. Develop a pipeline that allows
for simple usage by plugging it
into a CI system for
automated testing and
deployment.
54. !
But the hardest challenge is that change is dangerous. It’s even
more frightening when you have a MASSIVE chunk of the internet
depending on you to stay running ALL THE TIME.
55. Do it w/o taking down the internet
If we don’t build in the
necessary gates and levers to
allow for lots of testing and
controlled deploy options out
to our edge systems, bad
things can happen.
62. Initial Challenges
We have lots of FreeBSD
Change is hard - especially to unknown systems.
We really wanted to deploy a solution that was going to bring in
Zero Dependencies.
66. Now that FreeBSD problem is solved - we were able to start
deploying Chef out to all our nodes.
We created a role[base] - which includes a run list of items of things
we wanted in place.
About a month later or so - we wanted to push a change to that role
- at the same time it was linked to some specific cookbook versions.
67. So basically we wanted a versioned
run list - but we also want to set and
override some attributes also.
So we decided to move our roles (since
we were not using them much yet) and
just focus on using wrapper recipes.
The bonus here is that any person can
just clone a cookbook - and run Test-
Kitchen & Serverspec on that “role”
to get a node just like it. No dealing
with roles from other cookbooks.
69. The wrapper recipe idea made sense to us because we wanted to
make sure that when we used community cookbooks - we never
edited them. So for example we have a dyn_ci recipe which wraps
the functionality inside of the Jenkins recipe.
When Jenkins updates from 1.0 to 2.0 - we simply update and
refactor our wrapper cookbook and set the version constraint in the
metadata as appropriate.
71. We use the default chef-full template and it has a section that looks
like this:
72. !
Where are most community cookbooks stored? github.com &
community.opscode.com. Who does their DNS? You see where
we are going.
So - we created a new organization on our Enterprise Chef Server
- called the cookbook repo, where we stored community
cookbooks we used.
73. Later we moved those to Github Enterprise locally for 2 reasons.
1. It allowed anyone to easily see which cookbooks we already had
locally.
2.It allowed us to run short time forks of those cookbooks while we
pushed the changes upstream to the owner. (and for people to see
those changes.
74. Remove the humans from the equation
!
Foodcritic, chefspec, rubocop,
serverspec
thor-scmversion to automate
versioning and git tagging.
75. Run will
execute - if
the tests
pass - thor
will version
based on
#patch,
#minor,
#major
76. So we try to speed up the iteration to master
So - now the development cycle looks like
User cuts a branch - makes changes - runs tests locally (we hope) -
then submits a pull request.
Jenkins tests the PR - if good - report back to GH:E with Green.
When merged - Jenkins runs the tests again - if they pass then
Jenkins will tag the release and upload it to the cookbookrepo.
78. How has this worked?
We are the product owner
On-Demand support internally
Training
Mentoring
79. All new apps come with
cookbooks.
They even come with tests.
(Yay!)
Test Kitchen and Berkshelf
for our local development and
deploy
80. github.com/dyninc/cookbookapi
So we built our own
cookbook api to use
(with Berks 2) that let
us use our own site
with our own
cookbooks (and the
community cookbooks
in our site repo)
81. So how do you get it to production?
So - the requirements were such that we wanted a few thing
Easily be able to deploy to a single node in a site
Easily be able to deploy to a single node in many sites
Easily be able to deploy to a single node in every site
Easily be able to deploy to a single node in a region
Easily be able to deploy to a single node in many sites
…… you get the point. EVERY POSSIBLE DEPLOY SCENARIO.
82.
83. Represent state of chef org in Git
Act as single source of truth
Have Jenkins manage the upload of
those cookbooks to prod
Ensure the environment locks those
cookbooks explictly
84. So, i already told you we didn’t use roles because we really wanted to be able
to version the run list (many people other than us could be touching that).
We have thor-scmversion auto bumping the versions of cookbooks (and
freezing on upload to the package server) As one does.
We knew that when we ran node in production - we want it in an
environment with very specific cookbook version locks.
And we wanted those environment to be immutable. Created and uploaded
in an automated way.
85.
86. We’ve been using thor-scm for versioning our cookbooks - why not
our servers too?
96. Limited allow list for deploy
Anyone can propose a change to production - but the ops team will
approve those changes. (for #patch or greater that is)
The same workflow applies to pre-release environments.
97. Databags?
Since we version all of our cookbooks using Thor-scmversion
And we do the same with chef environments.
And we need lots of flexibility with our code deployment process
due to the nature of the system
We built a tool that allows us to version our databags for deploy.
https://github.com/Vanders/knife-databag-version
98. Version your databags?
Seriously - what is wrong with
you?
We use databags pretty sparingly
- mostly just encrypted databags
for shared secrets and other info.
Our engineers ask us for the
flexibility - we build the tools.
The tools enable the workflow.
99.
100. What’s this all look like?
assume we have a simple data bag item:
104. All managed by Jenkins - hands off
for the developer
Databags the same as cookbooks -
and allow for more flexible deploy
options for us.
We still use standard databags - this is
just another lever to pull
105. Room for improvement?
#minor and #major
Site to abstract changing
cookbook versions.
Upload cookbooks early -
control with environment
version locks