Rafter  case study
Upcoming SlideShare
Loading in...5
×
 

Rafter case study

on

  • 1,040 views

This is a case study of how Rafter maintains two data centers. It is a chapter in the book in progress: DevOps: A Software Architect's Perspective

This is a case study of how Rafter maintains two data centers. It is a chapter in the book in progress: DevOps: A Software Architect's Perspective

Statistics

Views

Total Views
1,040
Views on SlideShare
1,040
Embed Views
0

Actions

Likes
0
Downloads
53
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Rafter  case study Rafter case study Document Transcript

  • For Review Purposes 1 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu DevOps: A Software Architect’s Perspective Chapter : Case Study – Rafter, Inc By Len Bass, Ingo Weber, Liming Zhu with Chris Williams
  • For Review Purposes 2 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu This is a chapter of a book we plan on releasing one chapter at a time in order to get feedback. This chapter is the fifth one we have released. Information about this book and pointers to the other chapters can be found at http://ssrg.nicta.com.au/projects/devops_book. We would like feedback about what we got right, what we got wrong, and what we left out. We have changed the book organization from previously announced. Part I of our new organization fills in important background, Part II describes the deployment pipeline, Part III discusses cross cutting issues and Part IV is a collection of case studies that exemplify the issues we have discussed during the book. The new table of contents is: Part I - Background 1. What is devops? (initial release 2013.12.10) 2. The cloud as a platform 3. Operations Part II Deployment Pipeline 4. Overall architecture (initial release 2014.01.23) 5. Build/Test (initial release 2014.03.13) 6. Deployment (initial release 2014.4.1) Part III Cross Cutting Concerns 7. Monitoring 8. Security 9. Other ilities 10. Operations as a process 11. Business considerations - Topics include compliance, licensing, maturity model Part IV Case Studies This chapter is a case study about business continuity. Several others are under negotiation.
  • For Review Purposes 3 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu Case Study: Rafter Inc with Chris Williams Introduction For many years, students have been frustrated by the high cost of new textbooks and their low value as used textbooks. Rafter (originally BookRenter.com) saw this as a business opportunity and created a business in 2008 renting textbooks to students. The premise is simple: a student determines the textbooks they will need for the new semester and orders these textbooks from the BookRenter web site. Rafter ships the chosen books to the student. At the end of the semester, the student returns the books and they are available for other students for the next semester. As you may have deduced, this is a very seasonal business. If the BookRenter.com web site is down at the beginning of the semester, customers are lost. Business continuity during the high activity portions of the year is sufficiently important to Rafter that they have created a duplicate datacenter. With two datacenters, Rafter has the capability not only of moving service from one datacenter to another in the case of an outage of the primary datacenter but also a testing site that replicates the production environment. Keeping two datacenters synchronized is not only a problem of keeping the data synchronized but also a problem in keeping the environment replicated and making sure that applications are architected appropriately. In this chapter, we explore how Rafter accomplishes these different forms of synchronization. Two fundamental use cases exist for failover – controlled and uncontrolled. A controlled failover means that the primary datacentre is still available and there is time for a variety of preparatory measures before switching to the secondary datacenter. This is used to test the failover measures as well as to allow for maintenance to the primary datacenter. An uncontrolled failover occurs as a result of a disaster - whether natural or manmade. We return to these use cases after we describe the solutions that Rafter has put in place. Current State Currently Rafter runs two datacenters with exactly the same hardware. They run their own datacenters because, at the time this decision was made they could not get the necessary input/output performance from a public cloud. This decision is currently being re-evaluated although many organizations are required to operate their own datacenters for regulatory reasons. These two datacenters are on opposite sides of the North American continent but the users do not see a difference in response time. They support about 300 virtual servers using VMWare. A typical virtual server has 16GB of RAM and 4 virtual CPU cores. The frontend tier averages about 30-50,000 requests per second. The workload is approximately evenly split between reads and writes with about 80% of the requests coming from APIs and the remainder from web browsers. Approximately another 150 servers are used for analytical and staging/testing purposes. These run in a combination
  • For Review Purposes 4 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu of onsite private cloud (Eucalyptus) and public cloud (Amazon Web Services). Because the SLAs on these servers are more relaxed than the servers in Rafter’s datacenters, failover is not an issue Rafter has about 35 Ruby on Rails applications, about 50 backend applications in Ruby, and a couple of applications in other languages (Clojure, R). Rafter’s architecture is a standard three tier architecture – web tier, business logic tier, and database tier. We will describe the business logic tier and then the supporting database tier. Business Logic Tier We discuss two aspects associated with the business logic. First is the logic of the applications and second is the infrastructure that Rafter uses to maintain the applications. Application Logic If an application needs to store state that persists across multiple requests, then it must use data stores that can either failover (e.g. Rafter’s SQL database or in a resource that is externally available from both datacenters e.g. AWS S3). This restriction even applies to a single datacenter in a load- balanced environment. As you will recall from Chapter: Overall architecture, storing application state on a local server is not a good practice since a subsequent request that needs this data might get sent to another server, which doesn’t contain the data. Additionally, no application configuration changes are necessary, as all external resources point to DNS hostnames that do not change during a failover. Keeping Applications in Synch Every time Rafter deploys a new version of an application to production, it is deployed to both datacenters at the same time. This ensures that both of the datacenters are running the same version of the applications. The deployment system uses an infrastructure library to figure out which servers an application is deployed to, so there is no need to maintain separate lists of servers in the deployment system. One issue is performing maintenance in the secondary datacenter. An application server might be offline and unavailable for deployment. This problem is resolved by storing the latest version of every application in Chef Server. Once the operations team finishes their maintenance and powers the server back on, Chef will detect that the application on the server is out of date and deploy the latest version of the application. This ensures that application code stays in sync across servers even in the situation where servers may not always be available for every code deployment. Infrastructure The infrastructure support exists as a library. The library is packaged as a gem in “RubyGems”. RubyGems is a package manager for the Ruby language that provides a standard format for distributing Ruby programs and libraries. The library is used inside many of the infrastructure related applications and provides a framework for both adding new applications and discovering information about the infrastructure.
  • For Review Purposes 5 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu Adding an Application Every application in the Rafter platform has a small JSON (JavaScript Object Notation) file which serves as a blueprint containing instructions on how to properly install the application on the infrastructure. Examples of attributes in these JSON files would be:  name  type  git repository  hostname  cronjobs  daemons  log rotations  firewall rules  database grants  ssl certificates  load balancer VIPs The infrastructure library, which can run inside of Chef, reads these JSON files and determines how to setup the application. The library contains many defaults for the platform (that can be overridden if desired), so the JSON files tend not to be too large. For example, consider a simple Ruby on Rails application named “cat” that lives in a repository on Github. The JSON for this app might look like: { "id" : "cat", "repo_url" : "git@github.com:org/cat.git", "type" : "rails" } On the Chef server, the “cat” application is assigned to a server (or set of servers) either via a role or node attribute. Chef then runs on this server and uses the library to query the latest JSON for the cat application and apply the appropriate setup steps, such as: 1) check out the cat application from Github 2) deploy the cat application to the local server 3) setup nginx and unicorn (Ruby on Rails app server) with the proper virtual host for the application at cat.rafter.com and setup the proper SSL certificate 4) setup default log rotations for the app This is just a simple example, but the library can do more complex application setups like installing cronjobs and daemons, creating database accounts, managing database grants, controlling which developers have access to the application on the server, handling development and staging environments, etc. It can also create separate tiers for an application on different sets of servers.
  • For Review Purposes 6 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu For example, an application can be set up to only serve web traffic on one set of frontend servers, and only run cronjobs and backend daemons on another set of servers so the two instances of the application do no compete for resources. Discovering the Infrastructure Chef stores information about the entire infrastructure in the Chef server. The Chef server exposes a set of APIs for both querying this data and storing additional data about the infrastructure (via JSON based documents called databags). The Rafter library utilizes the Chef server as a database for querying information about the infrastructure. Here are some examples of the use of the infrastructure library.  Get a list of all applications on the entire infrastructure by calling: DevOps::Application.all  Get detailed information about a particular application (everything from where its Github repository lives to the database it uses): myapp = DevOps::Application.load(“myapp”) myapp.repo_url  git@github.com:org/repo.git myapp.application_database  myapp_production  Find which servers an application is deployed to: node = myapp.nodes.first node.name >web01  Get detailed information about the server, such as the datacenter hosting the server: node.datacenter.name >dc1  Get the state of the datacenter: node.datacenter.active? >true Different applications use this library for accomplishing different tasks:  When Chef is running on servers to apply configuration, the library is used extensively. Everything from which cronjobs to setup for an application, which daemons to setup, which firewalls rules to add, etc. is determined by the library. For example, here’s part of a Chef cookbook which adds a firewall rule to the server based on the state of the datacenter the server is in: if node.datacenter.active? iptables_rule "block_api" do enable false end
  • For Review Purposes 7 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu else iptables_rule "block_api" end  The deployment system uses this library to discover to which servers it should deploy a particular application. It may also need to make special considerations if a server resides in a particular datacenter. For example, two such deployment constraint are a. do not run database migrations from the inactive datacenter on deploy since this is slower, b. skip deploying to any server in an offline datacenter.  The staging and test systems use the infrastructure library to get information about all applications that can be deployed to staging servers and to determine inter application dependencies. Keeping the Infrastructure in Synch The secondary datacenter is set up exactly as the primary datacenter. It contains the same number of servers as the primary datacenter and all application servers are configured identically. So any time a new server is created in the primary, it is also created in the secondary at the same time. Chef keeps all of the servers configured the same way across both datacenters. There are, of course, some minor configuration differences between the two datacenters due to naming (different IP space and domain), but otherwise they are identical. Whenever a configuration change is pushed to Chef, it gets applied to servers in both datacenters. Database Tier Three different databases are used in the database tier, each with a specialized purpose. Transactional Data Figure 1: The Clustrix database management systems
  • For Review Purposes 8 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu The majority of the data for Rafter is stored in a transactional ACID compliant, SQL backed database called Clustrix. It is a clustered database appliance and is a drop in replacement for MySQL. There are 3 Clustrix nodes in each datacenter, although they behave as a single database. About half a terabyte of data is stored in this database and there are usually several gigabytes a day of changes. The Clustrix databases in each datacenter are configured in a Master-Master replication scheme. However, the database in the secondary datacenter has a READONLY flag set on it so no applications can accidentally write data to it. See Figure 1. Rafter’s goal is to keep the replication delay between the two datacenters as small as possible. Most of the day, it is usually under 1 second, but it is always something necessary to monitor since application changes can introduce new queries or data constraints that increase delay. This form of replication has the same sort of pain points as standard MySQL replication. The replication is single threaded and asynchronous, so both the complexity and rate of incoming write queries can cause the failover database to get behind the active database. Luckily, there are lots of tools for dealing with both of these issues and the database vendors are coming up with new solutions every year (e.g. hybrid replication, multithreaded slaves, slave prefetching, etc.). Some ways Rafter has tackled complex queries is by rewriting the query so it’s faster or breaking it into smaller chunks (e.g. LIMIT 10000). Another way is to swap to row based replication (RBR) so the slave does not have to run complex queries and can simply update any changed rows. Tackling sheer write volume can be more difficult though. One particular application was responsible for half of the total write volume. This data was not necessary for any real time application data, so instead of writing the data to Clustrix, Rafter changed it to write out to flat files and then a specialized application picks up the files on an hourly basis and import them into the data warehouse which does not have the same SLAs that the production infrastructure has. Although not currently used, Clustrix also allows setting up multiple replication streams for different databases, so that is another tool available for dealing with high write volume. Infrastructure Support Figure 2 Redis Rafter uses Redis in their infrastructure for both job queuing systems (Resque and Sidekiq), caching, and as a fast key-value store database. In each datacenter, there are two separate Redis clusters with each cluster setup as shown in Figure 2. One Redis node serves as the “master” where all applications send their reads and writes. It can then fail over to any of the slaves in the event of a failure. As the Redis usage increased, the Redis slaves could not keep up with the master due to the rate of writes being sent. Rafter started to notice the delay because the master server’s memory
  • For Review Purposes 9 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu was slowing growing yet the data being stored inside Redis was not growing. Rafter diagnosed this to find Redis was buffering all pending commands for the slaves in memory, but because the slaves could not keep up, this number kept growing. Others had also diagnosed this problem.1 The underlying problem was that Rafter was generating more Redis traffic across the datacenters than the available bandwidth would allow. The solution utilized was to run Redis replication inside of an SSH (Secure Shell) tunnel with compression enabled. The compression in the tunnel allows for substantially less bandwidth to be transmitted across the WAN (Wide Area Network) at about a 20% CPU increase. This change enabled the slaves to keep up with the master. Session Data Figure 3: Couchbase (Memcached) Rafter uses Couchbase’s implementation of Memcached for storing session data and caching throughout their platform. This is an example of one database in the platform where the data is not synchronized between the datacenters. See Figure 3 for a representation of Couchbase in the datacenters. Both Couchbase clusters are completely separate, so when the secondary datacenter is activated, the cache will be stale. This does not cause a problem since it is either temporary (session data) or it is cached data that the applications will automatically repopulate from other data sources on a cache miss. Other Infrastructure Tools Several other tools are used in the Rafter infrastructure. These include Gem repository servers and ElasticSearch. The management of DNS is also an important element of failing over from one datacenter to another. Gem Repository Servers 1 http://tech.3scale.net/2012/07/25/fun-with-redis-replication/.
  • For Review Purposes 10 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu Figure 4: Gem Repository Servers Ruby applications use “gems” as a way of packaging external libraries. Typically, Rafter packages shared code between their applications into “gems”. Particular gems and their versions are bound as dependencies inside applications. Rafter built a gem repository cluster in both datacenters as a place to publish private gems. Figure 4 gives a representation of the gem repository servers. Rafter began with an open source gem server called “Geminabox” which provides a simple user interface for developers to upload a new version of a gem to the server. However, it does not provide support for high availability and simply uses the local filesystem as its repository to store gems. In order to build high availability around this, Rafter first created multiple gem servers in each datacenter and put them in front of a load balancer. However, this did not automatically synchronize the data between all of the gem servers. Rafter then built a script that runs on each gem server which synchronizes the underlying gem repository with 2 neighbors (one in the local datacenter and one in the other datacenter). Under the hood, they utilized unison, an open source bi-directional file synchronizer tool similar to rsync that synchronizes the gem repository files every minute. They also built a monitoring script to query each of the gem servers and alert on inconsistencies. ElasticSearch Rafter has a single 6-node ElasticSearch cluster which spans both datacenters. ElasticSearch’s shard allocation feature can ensure that all shards are fully replicated to the secondary datacenter. ElasticSearch can also be taught datacenter awareness, so it prefers nodes in the same datacenter for queries. When Rafter first deployed ElasticSearch, they had connectivity problems between the datacenters since ElasticSearch prefers stable, low latency between all of its cluster nodes. There aren’t a lot of users running ElasticSearch clusters over a WAN link yet so it’s somewhat new territory but Rafter has not had any significant issues, thus far. Rafter did tune several TCP kernel parameters and ElasticSearch’s own timeout settings to avoid having nodes in the other datacenter constantly leave and join the cluster due to timeouts. DNS (Domain Name Servers) Rafter maintains 2 DNS servers in each datacenter that provide local DNS service. One DNS server behaves as the master for the other 3 DNS servers. These are standard BIND DNS servers running Webmin to manage them. Webmin provides a simple web based user interface for updating and deleting DNS records and more advanced settings like promoting a DNS slave to master or changing a slave’s designated master. Rafter built a Ruby library which interacts with this user interface so that DNS changes automatically. Rafter set a 60 second TTL (Time To Live) on most DNS records and a 1 second TTL on DNS records that point to database Virtual IPs. The choice of times is a trade off between fast recognition of a true failure and deciding there is a failure when one has not, in fact, occurred. Failover A datacenter can have 3 different states – active, inactive, and offline. Active means there are no restrictions on servers that reside in this datacenter. Inactive means that cronjobs, daemons, and backend applications are stopped and disabled. Offline means a datacenter is inactive and will not receive any new code deploys. Offline mode is typically used for maintenance. Rafter makes use of the first two states when performing a datacenter failover. A controlled failover happens for
  • For Review Purposes 11 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu maintenance or testing purposes. The primary datacenter remains active during a controlled failover. During an uncontrolled failover, the primary datacenter is inactive. We now present the steps that Rafter uses in both of these cases. Controlled Failover Steps For the purposes of these steps, assume the following key: DC1 = Datacenter to failover from DC2 = Datacenter to failover to 1. Verify Clustrix databases have low replication delay. First check that the replication delay is less than 60 seconds. This check is here because one time Rafter took all of their sites down for a datacenter switch and found that their database was 30 minutes behind due to a large query. They had to wait for the database to catch up before continuing the failover, which incurred additional downtime. 2. Disable alerts on monitoring system (Scout). Scout is a monitoring system similar to nagios that is used by Rafter for monitoring servers (CPU, disk, network, url healthchecks, etc). Rafter disables alerts temporarily in order to prevent getting pages and emails during the switch. 3. Put up “Website going down soon” banner. Rafter puts a small banner on their website telling customers the website will be going down for maintenance in 10 minutes and to finish up their purchases. The deployment system does this automatically by placing a special file in a directory on each application server. 4. Mark DC1 as “inactive” in Chef server. Using Chef server as a database, a new JSON document is pushed to the Chef server, marking DC1 datacenter as “inactive”. Any clients using the infrastructure library will be able to see this change. At this point in the process, both DC1 and DC2 are now in the inactive state. 5. Run Chef on all servers in DC1. Chef provides a command called “knife ssh” that allows one to run commands in parallel across a set of servers matching a particular search. The command: knife ssh “roles:dc1” “sudo chef-client” runs Chef on all servers in DC1. Because Chef uses the infrastructure library, it will detect that the server its running on is now in an inactive datacenter and take the appropriate steps to reconfigure it. This has the effect of removing any application specific cronjobs and stopping/disabling any backend daemons. 6. Send a -TERM signal to any running cronjobs or backend scripts in DC1. A significant number of applications run in the backend via cron and they are stopped by sending a TERM signal. This step is accomplished with a command similar to the following: knife ssh “roles:dc1” “ps hww -C ruby -o pid,user,cmd | grep app_user | awk '{print $1}' | xargs -i kill –TERM {}” 7. Send a -9 signal to any stubborn scripts that refuse to stop in DC1. Sometimes a TERM signal is not enough to make the script stop. For example, an application mistakenly catching all
  • For Review Purposes 12 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu exceptions (including the TERM signal). By now, most of these problem scripts in our applications have been fixed, but this step is in case some new application appears that isn’t a good citizen. 8. Put up Maintenance page on several web properties that have external customers. Rafter puts up a maintenance page with a message like “We’re currently performing scheduled maintenance, we’ll be back in 15 minutes”. The deployment system can do this automatically by placing a special file in a directory on each application server. 9. Promote a new redis master server in DC2 Figure 5: Redis promotion 10. Redis, by default, doesn’t have any automated failover so Rafter had to build their own system for doing this. It operates as follows (Figure 5): a. Configure all applications to connect to redis-master01 hostname. This resolves to a local IP alias in the datacenter. b. Release the IP alias from the current Redis master, this has the effect of severing any open Redis connections. Then write the current time to a heartbeat key on the master Redis server and then check that this key has the same value on the slave to be promoted. This ensures the slave is caught up to the master. c. Promote the new redis server to master, and reconfigure all the slaves to use the new master server. d. Bind the IP alias to the new redis master server. Send a gratituous ARP (Address Resolution Protocol) to ensure all ARP caches are flushed. e. Update the redis-master01 DNS record to resolve to the new ip alias. Relying on DNS has been one area of concern with this solution since clients can cache it, however Rafter had no such problems. The conjecture is that this is because client connections to the old master are interrupted when the IP alias is released which forces the redis client library to reconnect to the new master. Set a 1 second TTL on the DNS record. 11. Promote a new clustrix master in DC2. Because the Clustrix databases in each datacenter are configured in a Master – Master scheme, there is no need to reconfigure the replication on a failover. These are the steps: a. Add the READONLY flag to the database in DC1.
  • For Review Purposes 13 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu b. Check to make sure both databases are completely in sync by checking the bin log positions on each database. Check that DC2 has received all of DC1’s binary logs and vice versa. c. Remove the READONLY flag from the Clustrix DC2 database. d. Update the clustrix-master DNS record to point to the VIP for Clustrix in DC2. Simply updating the DNS record is not sufficient, as connections from applications will continue to stay open to the DC1 database. Loop through all open connections on the DC1 Clustrix using `show processlist`, and kill all open sessions using the `kill` command. This forces applications to open a new connection to the database and they end up connecting to the database in DC2 now. 12. Mark DC2 as “active” in Chef Server. This follows the same process as Step 4. At this point in time, DC1 is now inactive and DC2 is now active. 13. Run Chef on all servers in DC2. Again, the knife ssh command is used to run Chef on all servers in DC2. Now that DC2 is in an active state, Chef will perform all necessary configuration changes in order to convert the server to be in an active state. This essentially means installing cronjobs for all backend applications, installing and starting up application daemons, and reconfiguring application specific firewall rules. 14. Remove maintenance pages on all customer facing web applications. Using our deployment system, remove the maintenance page on all of the customer facing web applications. 15. Update Akamai CDN. Akamai sits in front of Rafter’s main shopping cart application as a proxy providing CDN services. Using a provided SOAP API, Akamai is updated to send traffic to DC2. Over the course of about 5 minutes, Akamai will start sending traffic to the new datacenter. 16. Update public DNS for all applications. For applications that don’t use Akamai, the public DNS records are updated to point to IP addresses in DC2 using AWS Route53. Because of the 1-minute TTL for the public DNS, within a few minutes of these changes, most customers start hitting applications in DC2. The infrastructure library assists in this task. The pseudo code for this looks something like: DevOps::Applications.all do |app| update_public_dns(app.fqdn, app.external_ip) end 17. Update local DNS for all applications. Update our local DNS servers so addresses of internal services point to IP addresses in DC2. Again, the library assists: DevOps::Applications.all do |app| update_local_dns(app.fqdn, app.internal_ip) end 18. Update escalation priorities in Rafter’s monitoring system (Scout). In the monitoring system, change the priority of alerts generated from servers in DC1 to “normal” priority. This means that operators continue to receive alerts via email from servers in this datacenter, but not pages. Update the escalation priority for servers in DC2 to “urgent” priority. This means that critical alerts for servers in DC2 can open alerts on Pagerduty. Rafter has found that monitoring servers in the inactive datacenter is still important, since one does not want to find out about potential server problems on the day of a failover.
  • For Review Purposes 14 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu However, email based alerts in place of pages is sufficient. This step is accomplished via Ruby code that interacts with Scout’s website via the Ruby mechanize library. 19. Re-enable alerts on Rafter’s monitoring system (Scout). Uncontrolled Failover In this situation, the currently active datacenter is completely unavailable, such as a power outage at the datacenter. There are fewer steps for this type of failover since it is not possible to do a clean shutdown of the primary datacenter. In this situation, there is significant potential for data inconsistency. There is most likely some data cleanup needed after the primary datacenter comes back online. For example, even with only a 1 second database replication delay, it’s likely DC2 will be missing some data at the time of the cutover. For the purposes of these steps, assume the following key: DC1 = Datacenter to failover from DC2 = Datacenter to failover to 1. Mark DC2 as “active” and DC1 as “inactive” in Chef Server. Publish a new JSON document to Chef server marking DC2 as “active” and DC1 as “inactive”. All applications using the infrastructure library will now be updated about the correct state of the datacenters. 2. Promote a local DNS server to master and reconfigure slave (if necessary). If the datacenter that goes offline contains the current master, then no local DNS updates can be processed. Thus, one of the DNS slaves in DC2 must be promoted to master and the other server should become a slave of the new master. This is done with a Ruby mechanize script using the Webmin webui portal. 3. Promote a new Clustrix master in DC2. This is a small set of the same steps performed for a controlled failover: a. Remove the READONLY flag on the database in DC2 b. Update the clustrix-master DNS record to point to DC2 4. Promote a new Redis master in DC2. This is a small set of the same steps performed for a controlled failover: a. Promote a Redis slave in DC2 to master. b. Bind the ip alias to the new master. c. Reconfigure the other Redis slave in DC2 to be a slave of the new master. d. Update the redis-master DNS record.
  • For Review Purposes 15 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu 5. Update local DNS for all applications. This is the same step as in the controlled failover. 6. Update public DNS for all applications. This is the same step as in the controlled failover. 7. Update Akamai CDN. This is the same step as in the controlled failover. 8. Run Chef on all servers in DC2. This is the same step as in the controlled failover. 9. Update escalation priorities in Rafter’s monitoring system. This is the same step as in the controlled failover. Automating the failover When Rafter first started performing datacenter failovers, some of the steps above were done manually and there was no process tying them all together. They simply wrote down all the steps in a Word document and executed them one by one. Once they were confident in the process, they began automating it. Rafter started by building a Ruby library that uses domain specific language for defining steps in the failover process. Failover steps fall into two use cases, either they launch an external shell command (for example `knife ssh` commands) or Ruby code (such as checking database delay). Here’s an example of Step 1 in the controlled failover: step "Check: Ensure replication delay is low on Clustrix" do prereqs :clustrix_databases, :datacenters ruby_block do result = nil puts "Checking Slave Delay in each datacenter" @datacenters.each do |datacenter| client = Mysql2::Client.new(:host => @clustrix_databases[:vips][datacenter], :username => "root", :password => @clustrix_databases[:password]) row = client.query("show slave status").first puts "Slave delay in #{datacenter}: #{row["Seconds_Behind_Master"]}" if row["Seconds_Behind_Master"] > 60 result = failed("Slave delay must be less than 60.") end client.close end result ? result : success
  • For Review Purposes 16 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu end end In the above step, the infrastructure library automatically runs the Ruby code in the ruby_block section during the datacenter switch. The block will either return a failure or success using the provided success/failed helper methods. Every step in the datacenter process has a corresponding step defined in the DSL. The DSL also contains the concept of “prereqs” since most steps needed a common set of data inputs (e.g. the datacenters being swapped, database information, and gathering credentials). Instead of having each step gather the same data, each step defines which prerequisite data it needs. The prerequisite is defined in another domain specific language and run at the beginning of the datacenter swap before any steps are executed. For example, here’s what the clustrix_databases prereq looks like: prereq "clustrix_databases" do if ENV[‘CLX_PASS’] db_info = {:password => ENV[‘CLX_PASS’]} else puts "Please enter the root password for the database:" db_info = {:password => STDIN.noecho(&:gets).chomp!} end db_info[:vips] = DevOps::Clusters.load(“clustrix”)[:vips] db_info end Next, the DSL was expanded to logically group steps together into lists to match the two types of datacenter failover. For example, here is the list of steps for a controlled failover: list "Controlled Failover" do step "Check: Ensure replication delay is low" step "Disable Scout Notifications" step "Put up 'Site Going Down Soon' message"
  • For Review Purposes 17 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu step "Set current datacenter to inactive" step "Run chef in DC to swap from" step "Run killer with -TERM" step "Run killer with -9" step "Put up 'Maintenance' pages" step "Swap redis" step "Swap redis2" step "Swap DB" step "Set new DC to active" step "Run chef in DC to swap to" step "Remove 'Maintenance' pages" step "Swap Akamai" step "Swap Route 53 DNS" step "Swap Datacenter DNS" step "Update Scout Notification Groups" step "Enable Scout Notifications" end For each step defined in the list, the library automatically looks for a step with the corresponding name. After Rafter had a framework built for defining steps and grouping these steps in a specific order, they built a console based application for running these lists interactively. Upon startup, it displays all the lists that have been defined by their name. For example: Pick a set of Steps to perform: <enter defaults to #1> 1. Controlled Failover 2. Uncontrolled Failover Choose: 1 In the console, the operator chooses a list to run. The UI then displays the name of all steps in the list and asks for confirmation. It also allows skipping over steps or starting from a different step if so
  • For Review Purposes 18 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu desired. The following steps will be run: 1. Check: Ensure replication delay is low 2. Disable Scout Notifications 3. Put up 'Site Going Down Soon' message 4. Set current datacenter to inactive 5. Run chef in DC to swap from 6. Run killer with -TERM 7. Run killer with -9 8. Put up 'Maintenance' pages 9. Swap redis 10. Swap redis2 11. Swap DB 12. Set new DC to active 13. Run chef in DC to swap to 14. Remove 'Maintenance' pages 15. Swap Akamai 16. Swap Route 53 DNS 17. Swap Datacenter DNS 18. Update Scout Notification Groups 19. Enable Scout Notifications Press Enter if satisfied with order, or type in alternative order of steps, comma delimited, ranges allowed:<enter> Next, the console application runs any prereqs for the given steps and then begins running each step, displaying any output of the step in the console. The amount of “auto pilot” is up to the operator. The operator can either run each step requiring an “enter” in order to proceed to the next step, or the list can run fully automated by automatically going to the next step once the previous
  • For Review Purposes 19 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu step finishes. In the case of a step failing, the operator can choose to skip to the next step, or retry the step again. The console application can either be run from a developer’s machine or a utility server that has network access to servers in both datacenters and internet access in order to access Chef server. All commands are run using the operator’s own provided credentials (sudo password, ssh keys, Chef key, etc.), so even if an unauthorized user has access to the console application, they won’t have the required level of access in order to run the steps. Testing As with all other processes and software, the infrastructure and failover control must be tested. Datacenter Switching Application Unit tests are written in RSpec for each step, the underlying DSL library, and the console application. This provides a good sanity check for obvious bugs in the code or logic. However, not everything can be fully tested since connections to resources like databases and external commands are stubbed out in the tests. In order to help find bugs that the unit tests might not be able to catch, the entire console application can be run in a dry run / debug mode by default. This will execute every step in the process but skip over parts of steps that do potentially unsafe things like change state. A helper method “debug?” is used in the steps for checking if the step is running in the debug mode. This allows to safely run through an entire datacenter switch in a dry run mode ahead of time. Infrastructure Testing A lot of the “heavy lifting” being done during the datacenter switch is actually outside of the entire datacenter switching application and is performed by the infrastructure library and Chef when it needs to reconfigure a server. The infrastructure library and Chef cookbooks are constantly changing and so they require more rigorous testing. In order to verify that a change does not break existing functionality, unit tests using RSpec are run. In addition to unit tests, Rafter also built a suite of integration tests. These tests first boot up a brand new server on the staging server platform, and then use Serverspec to verify that Chef and the infrastructure library configured the server properly. Continuous Integration Pipeline All changes to Rafter’s infrastructure and datacenter switching applications go through a continuous integration pipeline using Teamcity. Any new changes are committed to a branch named “test”. Teamcity then runs syntax checking, lint, and unit tests on this test branch. If everything passes, it then boots up a virtual server and runs the suite of integration tests against it. Finally if no tests have failed, Teamcity automatically merges the changes into a “staging” branch and then automatically publishes all infrastructure changes to the staging environment. After everything in working well in staging, the changes are manually merged into a production branch, tests are run again on the production branch, and then automatically published to the production environment if no tests fail.
  • For Review Purposes 20 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu Summary Having two datacenters with the ability to failover smoothly from one to another involves the applications, the database systems, and the infrastructure management system. In this chapter we described how Rafter achieved this ability. At the application level, applications had to be designed using the following principles: 1) Applications should only store state in approved data stores. 2) The data storage needs of an application must be considered at the design phase. Application changes and new features can cause significant delays in replication, which, in turn, harm the ability to fail over quickly. Usually some concessions can be made, for example, the data may not need to be replicated to both datacenters because it’s not business critical. Or perhaps, an application can store data in a different format so it stores less data (e.g. storing aggregate data instead of every record). 3) Backend applications must be designed to stop gracefully. Most of Rafter's backend applications are safe to be killed at any point, but some applications can be harmful if killed at the wrong moment. For example, imagine an application that charges a customer and then records it in the database. If the application is killed after charging the customer, but before recording it in the database, the customer could be charged again later. In order to prevent this, applications which are sensitive to being killed are built to intercept the TERM signal (using Ruby's trap callback), finish up only their current iteration, and then exit gracefully. For example: trap(“TERM”) do $should_exit = true end orders_to_charge.each do |order| order.charge! break if $should_exit end At the infrastructure level, Ruby on Rails and Chef were important tools but they had to be made datacenter aware through the construction of an infrastructure library that both managed the infrastructure and the deployment of various applications to the appropriate datacentre. At the database level, database systems that understand distribution and can be configured to support distributed data are a key ingredient but, again, Rafter had to make special provision to enable the systems to perform appropriately. Even with the appropriate architecture at each level, managing failover from one datacenter to another involves a sequence of steps that can be automated but that must be done in a particular sequence. Sources http://docs.opscode.com/chef_overview.html
  • For Review Purposes 21 17 April 2014 Copyright 2014 Len Bass, Ingo Weber, Liming Zhu http://rubyonrails.org/ https://rubygems.org/ https://github.com/ http://www.clustrix.com/ http://redis.io/ http://www.couchbase.com/ http://www.elasticsearch.org/ https://scoutapp.com/ http://rspec.info/ http://serverspec.org/ http://www.jetbrains.com/teamcity/ https://github.com/resque/resque http://sidekiq.org/ http://memcached.org/ http://www.akamai.com/