At least from where I look at the community, I mainly see OPS and very little dev when it comes in implementing best practices. If you’re going to treat your infrastructure as code, make sure to treat that code like your developers treat their code (and maybe even better ;)). Developers have a lot of great practices (and of course some bad ones) that an Ops team can pull into their workflow to improve their own coding standards. Here I’ll talk about how Twitter’s ops teams have pulled in practices and tools that our developers use to make our code better.
This should probably go without saying but all of your puppet code should be in some form of version control.Current Twitter is using SVN, possibly looking to more to git in the future.Standardizing commit messages can also help here as well. We require commit messages to mention who gave the code a “ship it” and what the RB number was that the ship it was in.SVN can be a good choice if you the have the possibility of dropping some large files into your code base. Be careful with this though. If you’re dropping RPMs or debs onto a system you’d probably be better off just setting up a repo. They aren’t that hard to put together.Branches also allow for an easier way of dealing with testing and pushing code into production. I’ll get into how Twitter handles this a little later.
Peer review of code can be very powerful. I know that I’m not the best person at Twitter when it comes to code in general and even with Puppet. But with peer review, we have the ability to easily pull in expertise before our code goes out and into production. Plus, everybody makes stupid little mistakes that aren’t seen after working on a set of modules for an hour that someone might see with a fresh set of eyes.
The peer review tool we use at Twitter is Review Board. We use the same tool that our developers use for their code reviews on the rest of the code at Twitter. Using a system like this you’ll also get an easy place to lookup the history of a change. We place the RB number in the commit message so if someone is looking through the commit log they can easily see what was added in each commit plus who shipped it (though this is supposed to already be in the commit message).Reviews aren’t hard to setup. - Make your changes. - Create a diff of your changes. - Create a new review board. Upload your diff, add some people and groups as reviewers, maybe put in a JIRA ticket with the bug you’re fixing or the feature being added, click publish - People added as reviewers will get an email about the RB they need to checkout. - Reviewers will review the code you’ve uploaded, make comments and then when it looks good you’ll get a “Ship it”. - Once you have a “ship it”, push your changes into the branch you were moving your code into. - Profit.
Have a style guide and make it useful. We had a style guide created by a group during a hackweek last year. PuppetLabs has a great style guide that you can use verbatim if it works for you. Or you can create an internal doc or wiki that takes what you like from the PuppetLabs style guide and adds your own tweaks to it.If you create a style guide try to get as much by in about. If you do something like review board, try to get people to enforce the style guide (or at least the majority of it) before giving a “ship it”. Style guides only work if people are willing to work with them but they can make your code much easier to read and work with for future engineers.
*NOTE*: This isn’t ownership with an iron fist that locks everyone else out!We have files in our source control systems that allow us to designate who owns certain modules. Just a simple list of usernames for the users who happen to know a set of modules. Owners just know this code better than the majority of the code in the modules that you might be working with.These files can either be for authorization or informative.Where this can be helpful if is if you’re making a change across a large set of modules. Let’s say you’re working on upgrading to Puppet 3 and you want to make sure that “provider => puppet:///<module>/<filename>” is setup like “provider:///modules/<module>/<filename>”
You might have one engineer that is working on an upgrade from Puppet 2.7 to Puppet 3.x. There are some formatting changes that need to happen for this and one is example is updating the path for a provider when puppet is being used. In this example, the engineer updates the code, setups up and RB using the OWNERS files for each modules so that all of the known parties are notified. Another email could also be sent out about the RB and that it’ll be left open for a set amount of time (~ couple days to a week) and then it will be pushed up and into production. This makes sure that you let everyone know about the change where as with just an email, you might miss a list that contains a group and then no one on that team sees the changes and something breaks for them on the merge.
Of course all of this can be automated and it should be because it’ll allow for more and more of your engineers to follow the best practices.
git review tools allow you to automatically create the diff of your changes and upload the changes into RB. You can also take the OWNERS files and automatically add them to the RB for you.puppet-lint can check the style of code and can be run by all of your users. You can even check the code that is being committed as a hook in your version control system. If it fails, then don’t commit and let the user know that they need to fix their style. If the style works then continue on.OWNERS can be read and uploaded with git review tools. If you want a more authoritative use of them, you can add a git hook that can check RB for ship its and make sure that somebody in the owners file for a module gave a ship it before it’s pushed to production systems.Script anything that you need to do if possible. Version control hooks can be great for this and help improve your code quality. You can script a basic regex that checks for certain things in a commit message as well to make sure you get all the information you need.
Figure out what you’re doing well and what isn’t working. If something isn’t working for you and your team, don’t keep doing it. Figure out what it not working, and either fix it or remove it. Don’t just do something because you “should”, do something because it’s what works.Also look at what’s missing from your workflow. For example, at Twitter we want to work on getting our puppet code tested. We have a side project that an engineer is working on that will work with rspec-puppet. We’re hoping that we can get some coverage on our code and make it even better. This can even be added as a CI job that runs before your code is committed and then if it passes, your code is committed to the production branch.
Because going directly to production is not always a smart idea.With our usage of SVN we don’t really have an easy way to branch a lot so we’re currently using three branches. We’ve named them head (which is the master branch), puppet-testing and puppet-production. With this branch system and how we run the puppet agent (discussed later), we’re able to effectively work on our infrastructure without screwing everyone else up.
Our head branch is where we do our development. This is the master branch of the repository so it’s the easiest branch to commit to. This branch can be rather volatile so we try not to leave hosts on this branch. With our puppet-util tool and ENC it’s easy to move hosts between branches.Normal workflow: - Make your changes - Commit into head - Put a machine on head in the role you’re working on and test out puppet - If everything is good then continue onto pulling the changes into the puppet-testing and puppet-production branches. - else go back to the first step.
Our puppet-testing branch goes along with our canary system for deployments. This branch used to allow for long term testing in a production setting. We usually have a small subset of the hosts in a role set to this branch to make sure that any changes that happen can be tested and debugged across the fleet before making it to production and affecting all hosts.Allowing major changes to sit for days to weeks on a subset of systems really allows us to make sure that what we’re sending to production is working properly. This is usually done for large DNS changes, JVM updates, kernel updates. Our kernel update process involves the OS team updating the kernel puppet module with the latest canary kernel that is automatically installed on system using this branch. Once the systems are rebooted they pick up the new changes.Usual workflow: - Changes made to head - Find commit ID for head commit - Use cherry-pick.sh utility to pull changes into testing branch - Get a ship it from somebody else who knows the code - Commit the changes into the repo for the branch
Our puppet-testing branch goes along with our canary system for deployments. This branch used to allow for long term testing in a production setting. We usually have a small subset of the hosts in a role set to this branch to make sure that any changes that happen can be tested and debugged across the fleet before making it to production and affecting all hosts.Usual workflow: - Changes made to head - Find commit ID for head commit - Use cherry-pick.sh utility to pull changes into production branch - Get a ship it from somebody else who knows the code - Commit the changes into the repo for the branch
Here is the basic workflow for our commits to our puppet repo. You start out working in head where develop, commit your changes and then test. If your changes are working as expected in your testing, you can begin promoting up into the other branches or you just keep repeating the three steps until you have what you need. Cherry picking is done by a script that takes a commit ID and a puppet branch. The script updates all of the files in the commit that you’ve references. From this you use a special commit to create a Review Board entry. You get a “ship it” or two from a couple other developers
This setup can be a lot for a new user to take in at first. But once it’s been done a few times it becomes quite easy.Sometimes you can have a few commits to head before you get it right and if you don’t document those commits and get pulled into something else it can become difficult to find all of your changes that need to be cherry picked.Commits can get orphaned in head. We try to remedy this with an email that sends out diffs between head -> testing and testing -> production. Unfortunately, these emails aren’t usually acted upon so these changes can sit in one of two branches but not all three which can cause some confusion in the future.
Since we’re using 3 different branches, we need a way for hosts to differentiate which code set they get. So, we run three puppetmasterd processes per host on 3 different ports. Each port is scaled differently depending how much traffic it should expect to get. So the port hosting our production code will usually run about 26 passenger processes, the canary and head branches due to their lower traffic patterns run about 4 passenger processes.We don’t run the puppet agent daemon but use a cronjob for clients. This tool, which will be discussed in depth later, uses roundrobin DNS to determine which puppet master to access.Since we have multiple puppet masters, certificates can become and issue. So, we have autosigning turned on and when the puppet-util script accesses the puppet master it asks for a certificate no matter what.
Audubon is an internal tool that was created for keeping inventory of hosts and roles at Twitter. At it’s most fundamental it’s a REST API that returns JSON blobs with information on a role, a host, an attribute, a fact or a group that you specify. It will also take wildcards so if you’re not sure exactly what you’re looking for you can find it. It’s also all command line driven which I find nice but some might prefer a web GUI for this information as well.Audubon has three basic constructs: - Facts - Attributes - GroupsSince Audubon was being built to house all of our information on hosts and roles it became a perfect fit for our ENC.
Facts have mainly been deprecated now and we have a script that now refreshes the majority of this information upon boot of the host as an init script. This data really doesn’t change so doing it upon reboot is enough and we don’t have to worry about grabbing the information during a puppet run.
Attributes are the most used and useful of the three. Attributes are anything that describes a host or role so it can include things such as, ownership, access groups, information about services running on the host as well as information about the host and its hardware.Attributes allow for settings to either be made on the server or role levels. On the server level, this host is the only one that care about this information and it’s unique to this host. We try to avoid these types of attributes for anything that describes the service but is useful for things like IP address, MAC addresses and other information like that. On the role level, this information is relevant across the entire role and can include information about the service used upon setup or instantiation.Attributes set on the role level at inherited so as your role gets new machines, you won’t need to update these attributes. Anything set as a server attribute though will need to be updated.There is also a hierarchy where server level attributes have a priority over role level attributes and will be returned first.One example of a role level attribute is our kernel or jdk_version attributes. Since we have the canary setup described earlier, we can set server level attributes to use the canaried versions of these but across the roles we use the kernel and JDK version that have been tested over the past month to 6 weeks.
Groups be used like attributes but they are strictly server level and are not inherited. So, you might want to use a group level attribute for something like cluster_name so when a new host is added to the role it is almost completely setup, but does not randomly join the rest until you tell it to so you can watch it and make sure it’s working properly.
Our ENC allows us to easily pull in values for whatever host the puppet agent is being run on. This allows us to easily make our manifests dynamic based on the information pulled from the ENC when doing the initial lookup for the role of the host.
So both puppet masters and puppet clients use our ENC so that we have one central location for all the information that we need. This makes it convenient for finding the information you need about a host and what should be happening in a puppet manifest if something goes wrong.
With our three branch setup, we can’t use the puppet daemon. So, we instead use a cronjob to run the puppet agent on a randomized minute every thirty minutes based on the hostname of the host. This helps us some with mitigating the possibility of a thundering herd upon the puppet masters. With our setup, there are a set of steps that each host must go through when contacting the puppet masters.
The basic process for how puppet-util runs:First we check to make sure that we have loony. After that get the branch that the host is using from the puppet_branch group in Audubon.If the branch name is head, canary, testing, prod or production then we’ll continue.If the branch name is off or something that we don’t recognize then we’ll kill the puppet run and let the user know.Take the branch name, and figure out which port to use against the puppet master.Get a puppet master from a well know DNS name that is setup for round-robin. After getting the name do some name to IP matching to make sure that we got back a good name. Next, use the name and the port to download the CA cert for the puppet master.If this is successful then we’ll continue on down the lineIf this is unsuccessful then we’ll retry up to 5 times before quitting the puppet runFinally, we’ll run the puppet agent command against the puppet master and port combo that we’ve found, download our catalog and apply it to the host (unless dryrun has been specified)This tool though doesn’t only do the puppet runs. It’s also our general puppet tool for doing tasks such as setting the branch, turning off puppet runs, doing dry runs and even some basic checking for puppet manifests that are constantly updating.
It’s a good idea to think of modules as a pyramid. You can subdivide the pyramid as much as you want but for the most part three basic categories should suffice for most and it’s basically what we do at Twitter.At the lowest level we have our base modules. The base modules are those that are imported by every system by default. Such things can include kernels, LDAP/Kerberos, YUM/APT repos, any tools or packages that should be installed across the board. These are the files that make your hosts production ready and just need the the extras from the service to operate fully.Next you’ll tend to have team level modules. These can be modules that include a set of tools, packages or tasks that your team wants to do across all of the hosts you run. This can include things such as a custom MOTD, cronjobs that check various aspects of the systems, etc.Finally we have service modules. These are the modules that setup the services on top of everything else that was defined in the lower two levels. These modules can include things such as setting up monit, creating the directory structure that your application is expecting, setting up logrotatedconfs for your service. Really this is anything that needs to happen and is specific to your service.
When you build a module for a system service, you gain the possibility to utilize it in more than one place. If you recreate your configs in a few different module then you run the risk of missing something when change occurs. Create abstract modules for things that most of your teams will use such as logrotated, monit, common release directories for the services you deploy (/usr/local/<service>, /var/log/<service>, /var/run/<service>, etc). If you have abstracted modules then everyone who is doing the same thing should be able to use your module to setup their services.E.g. JVM moduleWe’re moving from our Ruby stack to a bunch of JVM servicesEach service needs pretty much the same setupFiles created such as /usr/local/<service>, /var/log/service, /var/run/<parent>/<service>Configs for supporting services such as monit, logrotated, JVM metricsCreating a single module for this has allowed teams to have a common setup so it’s easy for others to figure out what is going on if needed.The default settings for the module should give you a working service. If they need to tweak it, you should allow it.
At times developers are going to request software applications such as DBs to test out changes on. Make sure that it’s easy for them to set these up and working with them easily. For example, we have a great DBA team but sometimes our devs want to try something in a DB in a non-production setting. So, our DBA team has created a module that will setup a MySQL DB in the same way that they use on any machine you set the attribute on. This way, the developers can easily setup a DB to play with in less than 5 minutes and 1 puppet run which makes them quite happy.
Over the coming months what are we looking to do with our puppet infrastructure? Our biggest goal will be to move towards Puppet 3.x. We have some updating of syntax and some canarying to do but this has been in the process for a few months and is a Q1 goal for completion. Perhaps by PuppetConf I’ll have an update on this front and possibly a few new sections on what to look for on a puppet upgrade (but haven’t most people upgraded?).We also have the possibility of moving away from SVN and to git. Git is the version control system that we use for almost all of our other repos. We also usually don’t keep large files in our puppet repo so git would be an interesting change. Using git could also allow us to more easily test out feature branches on systems without possibly having to make multiple commits into HEAD like we do today.
1. Find branch the host is using
2. Find the port that branch uses on masters
3. Get a master from DNS
1. Health check the master
2. If health continue, else repeat until timeout
4. Run puppet apply against the master and port