Hi! I&#x2019;m Jim Van Fleet. I&#x2019;m the founder and President of it&#x2019;s bspoke, a Ruby consultancy. Our clients include Relevance, Engine Yard, and Efficiency 2.0. I&#x2019;m here to talk about the Performance and Capacity of Rails Applications.
Central and Western Carolina are growing enormous data centers. Maiden, NC is 45 miles from Charlotte, NC where this presentation was given for the first time. Apple is spending $1b on this data center, whose stated goal is to back iTunes and the App Store.
Google is already using the $600 million dollar data center they&#x2019;ve built in Lenoir, NC about 75 miles away from Charlotte, near Hickory.
This week, Facebook announced a $450M data center for Rutherford County, about halfway to Asheville up Highway 74.
None of them are running any Ruby as a part of their run-times, at least as far as we know. Apple&#x2019;s stores are written in WebObjects, their own Java framework. Google&#x2019;s infrastructure is known not to include Ruby. Facebook is a famous PHP application, although they now compile that code to C++ for their runtime.
Twitter, the largest website with Ruby as a known part of its runtime, is getting something &#x201C;custom-built&#x201D; in Utah. Their entire funding raise is around $150m, so it&#x2019;s safe to say that&#x2019;s not all going to the data center. Groupon has no public plans to create a data center at this time.
Why does it matter? No big websites use Ruby, big deal.
The big deal is that, if you ever want to advance to the highest levels of this profession, you have to understand what is becoming common knowledge among those who maintain large presences on the web. One of them is that your database is going to get very unwieldy over a long enough timespan.
A layperson can probably guess at some reasons that Google wouldn&#x2019;t spend $600 million dollars on a single computer.
Where it gets trickier is in the details of how exactly to have more than one computer be responsible for being the authoritative source of long-lived data for an application.
The naive replica approach doesn&#x2019;t ring true either.
Google isn&#x2019;t actually hiring the best engineers it can attract to make sure that these two machines can stay running. We have some theory and some of their white papers that explain what they&#x2019;re up to.
I want to tell you about a couple of approaches to splitting your datastore onto multiple hosts.
BigTable and Dynamo describe data storage systems in use at Google and Amazon respectively. The ideas in their white-papers have inspired open source implementations like Cassandra and Riak to varying extents. They encourage the use of alternative data schemas and query forms to handle large data. The Sharding approach is more easily understood-- it&#x2019;s just keeping bits of the data in different databases and having clients figure out which ones to use and trust.
You should at least be prepared to grapple with these alternatives, although you almost certainly do not need to do so before you are quite popular. Be prepared.
That the underpinnings of the entire web application world are in question is a big deal.
At SurgeCon 2010 in Baltimore, Christopher Brown characterized those of us outside of Fortune 100 companies as being a punchline.
I believe this characterization has basis in fact.
In my various engagements with clients and teams over the years, the primary signifier of our immaturity is our inability or unwillingness to plan for failure. Do better.
John Allspaw, author of The Art of Capacity Planning, states in his blog that MTTR > MTBF in most cases.
with some caveats about some kinds of errors that are never acceptable.
Like the kind of errors GitHub experienced on Sunday when they dropped their production database. GitHub was still up and responding with 200&#x2019;s as this happened, by the way, allowing them the mechanisms to communicate directly with their clients about expectations for the restoration of service.
Release It! remains one of the finest books about programming that I&#x2019;ve ever read, and I recommend it to anyone who is interested in shipping code. It deals with these sorts of questions head on.
Here are some recommendations that you can use in your own projects, inspired by that book. If you can&#x2019;t access your &#x201C;master&#x201D; data store, you are likely not to be able to recover from that error. So catch it, and handle it in your ApplicationController.
The book also goes into the concepts of having a test harness to ensure that your client code and a load test for your server side. Don&#x2019;t be afraid to experiment, using these tests as a guideline for what you should expect.
So far, we&#x2019;ve talked about capacity and failure tolerance. As we shift into the other element of this talk, I should point out that performance and capacity are related, but they are not the same. You&#x2019;ll find a great description of this concept in Scalable Internet Architectures from Theo Schlossnagle.
Many Rails developers will experience their capacity as heavily influenced by the performance characteristics of their application on their application servers. This does not mean that high performance necessarily leads to high capacity.
Given high-quality programming, aware of its resource limitations, the number of VMs you have running Rails app servers and the number of workers per server will determine your capacity. As programmers, you can screw this up and make it your code, but hopefully you&#x2019;ll fix that. We&#x2019;ll get more into that idea later.
There are several stacks available to choose from in the marketplace. The moniker &#x201C;Platform as a Service&#x201D; or PaaS has sprung up around them. They each have different abstractions and limitations, but each offers a solution that can ensure you&#x2019;re focusing on development instead of system maintenance and operations.
They are all running on &#x201C;clouds,&#x201D; which is en vogue now. Everybody wants everything to be on the cloud. It&#x2019;s in commercials. I implore you to know what it means, and, perhaps more importantly what it does not.
Being on the cloud does not mean that you can increase capacity by spinning up a new node. It can, but only with adequate and thoughtful preparation.
In a post discussing scalability at the data storage level, James Golick points out that software cannot use resources that are not available.
In this same fashion, your system needs your mental resources when it&#x2019;s operating at capacity, or after a failure. After you boot your new node, what do you need to do to get your services running? Does it need to run memcached or sphinx? Does it need to know where redis is to enqueue jobs to run? How long is it going to take you to deploy to it and incorporate it? For most projects I&#x2019;ve seen, provisioning time for a new node is an inconsequential factor in MTTR.
Now that I&#x2019;ve satisfied myself that you&#x2019;ve been properly acquainted with the big picture, I&#x2019;ll get into the pieces that might be useful to you on a day to day basis.
If you walk away with one piece of advice, don&#x2019;t design by laptop. Your laptop is probably better than the machines you deploy to, and unless you keep every browser you&#x2019;ve got open to some Flash page, it will perform much better than the machines you&#x2019;ll ultimately run on.
The performance question, although infinite in its variations, almost always comes down to these four bottlenecks: CPU, amount of RAM, Network IO and Disk IO.
If you are CPU bound, you win!
You&#x2019;re in great shape to expand to new nodes when the time comes until something else is your bottleneck. There are cases I know I&#x2019;m missing here, but this is a good general guideline.
Using too much RAM is one of the most common limitations on capacity, but it will immediately become a performance problem as soon as the process begins to swap. The most common causes are bad uses of ActiveRecord (which are heavy on memory usage) and DOM parsing in XML.
One of the most effective means of getting higher capacity out of a single VM is to evaluate Ruby Enterprise Edition along with passenger. REE allows each Rails worker to use a much smaller amount of memory, although you should test and be sure your application can run on it. For even better results, rack up a Sinatra app for your high-concurrency endpoints.
Net IO bottlenecks are probably neck and neck with RAM problems for causing performance problems. Hopefully you&#x2019;ll use your failure planning as an opportunity to ensure your application is not affected overly by the first two, marking any non-critical service as &#x201C;TRY AGAIN LATER&#x201D; and proceeding on its merry way, although maybe a bit slower.
Most Rails apps end up with their response time dominated by time spent in the database, although every workload is different.
High Performance MySQL has my highest recommendation, if you are using MySQL.
BTW, did you know that top and ps aux will show you your slowest queries in real time using PostgreSQL? True story.
Just read the slide on this one.
This is a grab bag of gotchas and relevant to-do&#x2019;s.
Disk IO is one of the nastiest. You basically have to dig around it, and usually lean on your host. Temporary or permanent disk IO underperformance is unfortunately still somewhat common on many hosts.
Shared filesystems are particularly troublesome. It turns what looks like disk IO into a large amount of network IO, and can really quickly complicate things if there is a underperformance in the storage layer. Avoid them if at all possible with your architectural schemes.