Rails Operations Lessons learned from deploying and managing hundreds of Rails applicationsThanks for coming out this morning. I know it’s hungover oclock,so it means a lot. You are dedicated, upstanding individuals.
• @techpickles • http://github.com/technicalpickles • http://technicalpickles.comI am from the internet
Awesomeness Engineer of Supreme Versatility IIMy official title is Awesomeness Engineer of Supreme Versatility.2. (I recently was promoted)
Managed hosting and operationsWe’re mostly known for our hosting. What isn’t as well known is our managed services. For this, we engage more closelywith our customers.When bringing on new managed customers, we work with them to spec out servers, review application’s needs. We getthem up and running on these servers with our conﬁguration management tool, moonshine. And once deployed, weprovide 24x7 monitoring. If you’re server goes down, we let you know, and get it back online as soon as possible,regardless of when it happens.And that’s not all. Once live, we provide operational support. Anything from application performance analysis,recommending architecture improvements, installing and managing new software on servers, or just being there to givefeedback on how the application is operating.You can basically think of us as a Rails Operations company.
I’m talking about Rails OperationsConveniently enough, I’m talking about Rails Operations today.
WTF is Rails Operations?I found this hard to distill down to a simple statement.I think it’s safe to say that the majority of us are developers. We write code, build applications, launch products.A lot of organizations, operations is something different. eople associate operations with system administration. Andto an extent, this can be fairly accurate. Different people, different teams, different. As developers, we write somecode, and toss it over wall, and let _them_ handle it.I think this is a bit ﬂawed. The code you write has an operational impact. The systems you run it on have anoperational impact on your code. It’s a complex relationship, and when developer and operations teams areseparate, it’s hard to bridge the gap between, since it’s neithers responsibility.
Development and maintenance of a production Rails applicationThe simplest deﬁnition I’ve found is this.
Very important assumption You develop code that will eventually go into production, and in part to some business model, generate revenueThat is to say, you are part of some organization
• Working features (or at least that work enough)• Infrastructure to keep the application up and running (or at least up enough)• A business model• Sheer determination• Good luck
Lessons learnedAlright. I’ve given you a deﬁnition of Rails Operations, and had abrief detour to talk about the business and where developmentand operations ﬁt into it.Now for some lessons. Basically, I’ll be going over some patterns,some antipatterns, and other practices and topics.
Common threadsPutting this all together, I kept coming back to some commonthreads. That is, some ideas that apply to many aspects. I’m goingto start you off with a few together, and then just jump into thelessons. We’ll probably pick up a few more along the way.
Give a damnIf you don’t care about what you’re doing, everything else I’mtalking about today probably doesn’t matter. I don’t think youneed to worry about this though, since you are here.
Earlier we talked about how operations preserves revenue. To thatend, our goal is to mitigate risk as much as makes sense.
Tradeoffs and compromise. Each possible solution has them. Thetrick is understanding that there are tradeoffs. What tradeoffs youmake depends on what your priorities are. For example: * Dollar signs * Time * Sanity * Technical debt * Higher risk
You write code that manages your servers’ conﬁgurationTake a moment to think about how you might describe a server to someone.There’s plenty of nouns:* packages* users* ﬁles* cronjobs* servicesAnd some verbs:* running commands
• apache package is installed• apache service is running• deploy user exists• cron jobs• etc
AutomationBootstrapping. Anyone that has setup a new server from scratchcan tell you... it’s time consuming, labor intensive, and errorprone.Bootstraping is just part of it though, only ever happens oncethough. What’s more interesting is that you can use this tomanage your infrastructure as it involves. Need to start usingredis? Just add it to your conﬁguration management, and you’llhave it next deploy.
The best way to illustrate why you should be using conﬁguration management is to explore theconsequences of not using it.Imagine it’s time to add a new application server. Your application is under heavy load, and needs thisserver to be up and serving requests. How long will it take you to get it up? And how will you know it’ssetup correctly? If you’re doing this all manually, you can’t really know the answers to these questions.Here’s another example. Adding a new dependency to your application. It can be a gem, a nativepackage, a new daemon, whatever. How do you ensure this gets on the server when you need it?Deploy and pray? Log into the server and install it yourself? This sucks, and kind of risky especially ifyou’re talking about production.
As always, there’s tradeoffs to be made.Setting up and learning how to do conﬁguration management takes time. Time that could bespent working on user-facing tasks.Taking on risk of having to cold deploy, or having deploys fail because of missingdependencies.Usually, the balance is to have to take the risk and have it burn you enough times that it’s morepainful to not stop and get your conﬁguration management on, that it is to not do so.If you do know it, it’s a no brainer. Just DO IT.
conﬁguration management + staging servers = VERY YESIf you use conﬁguration management, and have staging servers,then this is a huge win.We talked about adding new dependencies earlier. If you aredoing conﬁguration management, then staging is the ﬁrst placeyou can see if ur doing it right.
There’s basically no downside to using staging servers. The onlytradeoff though is that servers do cost dollar signs and stagingservers are no different. This leads us to a new thread...
Maths... look around you. In most cases, you can do some dollar sign math to justify costs of a thing. Let’s try this.A staging server may cost $60/moBut how can you calculate the cost of not having a staging server? Let’s assume that if you don’t have a staging server,you’re bound to do a bad deploy that it could have prevented. Some code that doesn’t work outright, or is otherwiseﬂawed. Let’s say it causes an hour of downtime while you determine the problem and try to ﬁx it. Do you know how much itcosts your business in lost revenue to be down an hour?This is actually a pretty mature question, and I’d be surprised if many people can answer it off hand. In any event, I thinkwe can do some fuzzy math to say yeah, it probably is more than $60. If that’s the case, then one failed deploy a month isenough to validate a staging server.
Repeat after me• development• staging• production
capistrano-gitﬂowWhenever possible, I like to enforce standard by means of automationFor the ﬂow of code from development -> staging -> production, we have capistrano-gitﬂow.Originally done up by apinstein, I did some refactorings and cleaned it up enough to be usable as agemEffectively, this enforces development -> staging -> production. Whenever you deploy to staging, ittags the current branch including information about the date, the user deploying, and a small blurbabout the changes. Assuming this is cool, you can promote a tag to production and go on from there.If you haven’t deployed to staging yet, you’ll be promtpted and it will default to using the lastproduction tag.
A play on release early, release often. Although technically, I guess it’s the sameIt’s basically the same thing we hear in the open sourcecommunity.The sooner you release code, the sooner you can validate it andthe sooner you can get feedback. Does it work? Does it not breakthe entire site? Are users happy?
By deploying early and often, we’re also limiting risk. The lesschanges that go out in a single deploy, the less things there arethat can possibly break. By waiting to deploy, you’re accumulatinga larger set of changes to deploy, and therefore there’s moresurface area to debug if it breaks.
In a way, you can consider undeployed code a liability.Imagine spending a day or two doing some code cleanups to get ready for a sprint. Should you deploywhen you are done and happy with the refactorings, or should you go ahead and do your sprint.If it were me, I’d deploy the refactorings ﬁrst. That way, the code is out there, and you’ll know if itperforms equally to its nonrefactored version. It’s really easy to introduce performance killing changesin even a few line diff.If you instead wait and deploy with new features, if anything goes awry, you have signiﬁcantly morecode to spelunk to track down a potential problem.
What does it even mean?This drives me nuts. By saying something ‘feels’ slow, there’s animplied assumption. The assumption is that it should be fast.Saying it like that is...weird, because it gives no indication of whatis slow or not.The trick is in determining what the assumption is, and thenﬁnding a way to measure and identify the problem.How can we do this?
Metrics everywhere!With the right tools, you can easily be continuously collecting dataso you have it in your pocket when you need it.
• New Relic - http://newrelic.com • Scout - http://scoutapp.comThese are the two we use and highly recommend.New Relic is really great for giving a high level view of your application. We’re talking at the request response level,including all sorts of fun maths with most time consuming requests, highest standard deviation, etc. It also breaks downrequests by where time spent. Like if it’s all in the view, the controller, the database, partials, etc etcScout is useful for other reasons. While New Relic is good for high level understanding of your application, Scout is a bitmore low level. You can use it to collect metrics about your servers, and how well they are running. Memory, CPU, diskspace, IO, mysql connection stats, and so on.I really believe these are a great combination, because New Relic can point you in the direction of a problem area, and Scoutcan better understand what’s contributing to it at a system level.
The front page feels slowThe front page is taking 10 seconds to load, but we really need it to be loading in under 1 second
The primary key seems like it’s increasing rapidlyThe primary key is at 90% of it’s maximum, up from80% yesterday, and looks like it’ll run out overnight.
IO seems highIO ﬂuctatues up to 90% sometimes, but doesn’t appear to have a negative effect
Further Reading • Web Operations - John Allspaw and Jesse Robins • Continuous Delivery - Jez Humble and David Farley • “Web Operations for Developers 101”http://www.amazon.com/Web-Operations-Keeping-Data-Time/dp/1449377440/ref=sr_1_1?s=books&ie=UTF8&qid=1314447411&sr=1-1http://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-Wesley/dp/0321601912/ref=sr_1_4?s=books&ie=UTF8&qid=1314447411&sr=1-4http://www.paperplanes.de/2011/7/25/web_operations_101_for_developers.html