19. gap analysis
• load balancing?
• how do we manage communication
between instances? what about talking
back to the datacenter?
• how do we scale up and back?
• how do we secure the instances?
20. nginx
• elastic ip points to nginx which handles all
of our traffic
• nginx has the rules which determine where
to send requests
30. Nimbul
Meta Data Store
Configuration Management
Access Control
Publishers
Sane Auto-Scaling UI
F2WW
31. Nimbul Cloud
Providers
( EC2 )
Provider Accounts
( Dev, Staging, Production )
Clusters (“Slices”)
( UGC Staging, WWW Production )
Server Profiles
( UGC FrontEnd, UGC MySQL Master )
Instances
32. Nimbul Users
Nimbul Admins
( Full Access, can’t read keys )
Before Nimbul
Provider Account Admins
( Control Users, Resources, Env Vars, Startup Scripts, etc )
Cluster (“Slice”) Admins
( Control Users, Resources, Env Vars, Startup Scripts, etc )
SSH Users
( Can be granted SSH access to any running instance )
After Nimbul
comments on articles and blogs - we get about 130K comments per month and 1.5 million reader recommendations.
rate and review for movies, theater, dining and travel destinations
going back about 2 years now - comments on articles had been live for a year. we (the ugc platform team at the times) were in the process of standardizing the entire platform and adding features like reporter replies and the community open api. We had ramped up our internal community hardware for the presidential elections, adding a few servers to handle the extra traffic we were expecting. One friday around 6pm I get a call from systems saying we were having trouble with our api servers, the load was off the charts.
I immediately dig in and go into the controlled panic that settles in when you get a call like this from systems. Soon enough the alerts started rolling in for the front end machines as well. With some log checks we quickly realized our friends at yahoo were linking to a story that had comments turned on. We were seeing around 600 requests per second which was too much for our current architecture to handle. Unfortunately we had no choice but to turn comments off on the story as it was affecting the rest of the platform.
This brings to light a couple of things. One, we needed to rethink the architecture a bit, figure out a way to scale dynamically. Quickly scaling hardware for us currently meant scrambling to get a request in and then actually acquiring it and getting it all set up we were looking at a month (if it was quick.)
So, what do we do? We had 2 options. Another round of capacity planning, getting a few more machines to be able to handle the spikes. Boring.
Another, much more sexy option was moving out to the cloud. At the time some of our colleagues had been playing with applications on amazon's ec2 infrastructure with much success. Thinking about it, this could be the answer to all of our worldly problems. It was also an intimidating proposition as no one had moved an entire platform out there yet but the upside was a never ending source of amazon instances to scale up and down as we please.
The key here we thought was not only scaling up for spikes but perhaps scaling down at night when not as many of you were commenting.
Back in 2007/2008, this was our setup which utilized 6 frontend zones, 2 api zones, 6 backend zones and then we had one master db and 3 slaves. memcached was running on the backend zones. You can tell how long ago it was from this ancient looking diagram.
So as we closed in on the architecture we came up with a similar set up in the cloud with front end, api, memcache and mysql instances filling out the platform. We didn't change much in the way the platform looked except to split out the caching but we definitely had some gaps to fill.
We had lots of questions that were fun to answer. How would the front ends know which api instance to request? Or where exactly is that database the api instance is supposed to query? Better yet, how are we going to manage all of these instances? How exactly will it scale? How will we request internal api’s that live back in our data center?
For load balancing we set up an instance with only nginx and assigned an elastic ip to it. We did the same for proxying requests back to internal api’s.
when we have to scale up or back, we have a shared host file that is automatically changed to add or remove the instances. This host file is then pushed to each instance. monit is watching that file and bounces the load balancer when it changes.
For security we simplified the use of amazon security groups to make it easy to assign groups to specific server types. For instance, if I am a community front-end instance in production I would grab the production security group as well as the general community group and then the specific cmty-fe group.
we went with a couple of different options set up in the cloud for monitoring and alerts. we use nagios for monitoring and alerts. we’ve set up munin for the pretty pictures.
one of my personal favorite nice to haves to come out of this project was individual development instances. we created a condensed version of the entire platform on a small ec2 instance with a recent snapshot of our staging database and all of our code. With our cloudsource deployment system that Vadim will cover in a few minutes, we can grab any version of our code to deploy on these instances.