Case Study - How Rackspace Query Terabytes Of Data
How Rackspace Query
Terabytes of Log Data
Uses MapReduce, Hadoop
Case Study, by Schubert Zhang 2009-04-30
• Rackspace has more than 50K devices and 7 data
• The mail system and logging servers are currently in 3 of
the Rackspace data centers.
• The system stores over 800 million objects (an object = a
user event such as receiving an email or logging into
IMAP) within Solr and 9.6 (records?) billion within
Hadoop, which equals 6.3 TB compressed.
• Several hundred gigabytes of email log data is
generated each day. (seems 140GB after cleared up)
Background on Mailtrust
• Email hosting company
• Founded in 1999, merged with Rackspace in 2007,
previous name: Webmail.us
• 80K business customers, 700K mailboxes.
• 2 hosted mail products: Noteworthy, MS Exchange
• The Noteworthy System:
– Homegrown, Linux based, POP3, IMAP, webmail, RSS feeds,
shared calendaring, Outlook sync, Blackberry sync.
– ~600 servers, commodity hardware, designed to work around
• The MS Exchange System:
– MAPI, POP, IMAP, OWA, Blackberry, Goodmail, ActiveSync.
– ~100 servers, higher-end hardware, SAN & DAS storage.
• Hundreds of gigabytes of new data each day streaming in from over 600 hyperactive
• Log processing system.
– (1) Flat text files stored on each machine.
• Had to be manually searched by engineers logging into each individual machine.
– (2) Relational database solution that just couldn't compete. MySQL.
• Inserts quickly became the bottleneck.
• A lot of index churn.
• Data was then broken into Merge Tables based on time so index updates weren't a problem.
• Load and operational problems.
– (3) Hadoop based solution that works wisely and has virtually unlimited scalability potential.
• Lucene and Solr.
• The familiar faced problem now: Lots and lots of data streaming in.
– Where do you store all that data?
– How do you do anything useful with it?
– How to retrieve the wanted data from the data sea.
• Examine mail logs in order to troubleshoot problems for our customers.
• The query/search should be fast and accurate.
Now the new system
• The advantage of their new system is that they can now
look at their data in anyway they want:
– Nightly MapReduce jobs collect statistics about their mail system
such as spam counts by domain, bytes transferred and number
– When they wanted to find out which part of the the world their
customers logged in from, a quick MapReduce job was created
and they had the answer within a few hours. Not really possible
in your typical ETL system.
• "Now whenever we think of complex question about our
customers’ usage patterns, we can pull the answer from
our logs within hours via MapReduce. This is powerful
• Hadoop MapReduce
• Hadoop Distributed File System (HDFS)
• Raw logs get streamed from hundreds of mail servers to
the Hadoop Distributed File System (”HDFS”) in real time.
• MapReduce jobs are scheduled run to index the new
data using Apache Lucene and Solr.
• Once the indexes have been built, they are compressed
and stored away in HDFS.
• Each Hadoop datanode runs a Tomcat servlet container,
which hosts a number of Solr instances that pull and
merge the new indexes, and provide really fast search
results to our support team.
The System Evolution
• Logs were stored in flat text files on the local disk of
each mail server and were kept for 14 days.
• Our support techs did not have login access to the
servers, so in order to search the logs they would have
to escalate a ticket to our engineers. The engineers
would then have to ssh into each mail server and grep
• Problems: Once we grew much past a dozen servers,
this manual process of logging into each server become
too time consuming for our engineers.
• Sped up the search process by writing a script that would search
multiple servers via one command run from a centralized server.
• Remote still grep.
• Problems: The support techs still had to escalate a ticket to the
engineers in order to perform a search. As the number of customers
and servers increased, this began to take too much of our
engineers' scarce time. Also, storing and searching the logs on a
live server was negatively affecting the performance of the servers.
To make matters worse, the engineering team had grown and we
started running into the problem where two engineers would perform
a search at the same time, which really slowed things down.
• a web-based tool where they could search the logs.
• It allowed searching by the sender or recipient's email address, domain name or IP
• All of these were indexed fields in a MySQL database. The centralized log server
• Each day's logs were stored in a separate table, so that we could cleanup old data by
simply dropping and recreating MySQL tables.
• Log data was only kept for 3 days in order to keep the MySQL database down to a
• Wildcard text searches (i.e. MySQL "LIKE" statements) were not allowed because the
data set was very large and these queries would be horribly slow.
• Problems: We quickly realized that we had a bottleneck with the MySQL inserts. As
the tables grew, indexing each entry as it was inserted became slow. Within the first
hours of testing, the inserts began slowing and could not keep up with the rate at
which data was received. Version 2.0 of the logging system was never used in
• Fixed the MySQL INSERT bottleneck by queuing up the log entries
in local text files on the centralized log server and periodically bulk
loading them into the database. As syslog-ng received logs on its 6
ports, the data would be streamed to 6 separate text files. Every 10
minutes a script would rotate those text files and execute a MySQL
LOAD to load the data into the database. This was magnitudes
faster than inserting the log data one record at a time.
• Problems: The LOADs would get progressively slower as the
database grew because MySQL indexing performance decreases as
the table you are inserting into gets larger. This version was fast
enough to be released into production, but we knew the system
would not scale too far without additional work.
• Introduced Merge Tables in order to speed up loading the log data into the database.
• every 10 minutes our script would create a new database table and then load the text
logs into the empty table.
• After the data was loaded, the script would modify a set of Merge Tables that
combined all of the 10-minute tables together.
• The web search tool was modified to allow searching within the different time ranges.
Corresponding Merge Tables existed for each of those time ranges, and were
modified every 10 minutes as new tables were created.
• Problems: the database LOAD operations would take 2-3 minutes to run. the server
was now always under a heavy cpu and disk IO load.
• Searches were being performed more frequently and were becoming slow. We
started to see some strange problems such as random errors while trying to create
new tables or modify the Merge Tables. These errors progressively became more
frequent, resulting in missing log data. The support team began to lose confidence in
the system's accuracy.
• the logging system had no redundancy.
• We needed a new solution that would be fast, reliable and could scale indefinitely
with our growth. We needed something truly scalable.
• Avoid limiting our abilities to build new features down the
• For example, we wanted to build a tool that would allow
our customers to search their logs directly.
• It scales out it's workload horizontally by adding servers
and distributing the data and MapReduce jobs amongst
• In about 3 months we build a fresh new log processing
system using Hadoop, Lucene and Solr.
• Put the log search tool in the hands of our customers.
Stu Hood’s Detailed Comments
• The loading of data is streaming, but the indexing is not. We write to a file in Hadoop until it
reaches a size below the block size, or until it times out, and then we close and move it to where it
will be processed.
• Our processing jobs run every 10 minutes or so, meaning that the logs become available for
Customer Care after about 15. We’ve executed around 150K jobs on this cluster with 3 restarts.
• We create the indexes on local disk in our reducer, and compress them into HDFS after they are
• When we pull the index to make it available for search, we decompress it to local disk and merge
it using the Lucene IndexWriter.addIndexes method before calling /commit on the Solr instance.
The Nutch project created an IndexReader that can do read-only access on HDFS, but for speed
reasons, we decided not to take that approach.
• Since we are indexing to local disk, we use an embedded SolrCore, in the same JVM as the
• We have 10 Hadoop data nodes, with 3.5TB hard drives each. = 35TB
• We are currently indexing an average of 140GBytes per day.
• The merged indexes are not replicated at all… only one Solr node has a copy of each index, so
failover involves a brief downtime for queries. If we lose a node, other nodes (consistent hashing)
become responsible and merge the indexes from the copies we always have in Hadoop.
• Creating reports or doing ad-hoc queries.
• More wanted MapReduce jobs to do
• How Rackspace Now Uses MapReduce
and Hadoop to Query Terabytes of Data
• MapReduce at Rackspace