You’ve got HBase
How AOL Mail handles Big Data
Presented at HBaseCon
May 22, 2012
The AOL Mail System
Over 15 years old
70+ Million mailboxes
50+ Billion emails
A technology stack that runs the gamut
What that means…
Lots of data
Lots of moving parts
Mature system + Young software = Tough marriage
We don’t buy “commodity” hardware
Engrained Dev/QA/Prod product lifecycle
Somewhat “version locked” to tried-and-true platforms
Expect service outages to be quickly mitigated by our NOC w/out waiting for an on-call
So where does HBase fit?
It’s a component, not the foundation
Currently used in two places
Being evaluated for more
It will remain a tool in our diverse Big Data arsenal
An Activity Profiler
An “Activity Profiler”
Watches for particular behaviors
Designed and built in 6/2010
Originally “vanilla” Hadoop 0.20.2 + HBase 0.90.2
1.4+ Million Events/min
60x 24TB (raw) DataNodes w/ local RegionServers
15x application hosts
Is an internal-only tool
Used by automated anti-abuse systems
Leveraged by data analysts for adhoc queries/MapRed
An “Activity Profiler”
Why the “Event Catcher” layer?
Has to “speak the language” of our existing systems
Easy to plug an HBase translator in to existing data feeds
Hard to modify the infrastructure to speak HBase
Flume was too young at the time
Why batch load via MapRed?
Real time is not currently a requirement
Allows filtering at different points
Allows us to “trigger” events
Designed before coprocessors
Early data integrity issues necessitated “replaying”
Missing append support early on
Holes in the Meta table
Long splits and GC pauses caused client timeouts
Can sample data into a “sandbox” for job development
Makes pig, hive, and other MapRed easy and stable
We keep the raw data around as well
HBase and MapRed can live in harmony
Bigger than “average” hardware
Proper system tuning is essential
Good information on tuning Hadoop is prolific, but…
XFS > EXT
JBOD > RAID
As far as HBase is concerned…
Just go buy Lars’ book
Careful job development, optimization is key!
Contact History API
Contact History API
Services a member-facing API
Designed and built in 10/2010
Modeled after the previous application
Built by a different Engineering team
Used to solve a very different problem
3+ Million Inserts/min during MapRed
20x 24TB (raw) DataNodes w/ local RegionServers
14x application hosts
Leverages Memcached to reduce query load on HBase
Contact History API
Where we go from here
Amusing mistakes to learn from
Batch inserts via MapRed result in fast, symmetrical key space growth
Attempting to split every region at the same time is a bad idea
Turning off region splitting and using a custom “rolling region splitter” is a good idea
Take time and load into consideration when selecting regions to split
Backups, backups, backups!
You can never have to many
Large, non-splitable regions tell you things
Our key space maps to accounts
Excessively large keys equal excessively “active” accounts
Introduce myself: I am Chris Niemira, a Systems Administrator with AOL. I run a number of Hadoop and HBase clusters, along with numerous other components of the AOL Mail system. I spend my days doing work that ranges from system patches, code installs and troubleshooting, to capacity planning, performance and bottleneck analysis, and kernel tuning. I do a little engineering, a little design work, an on-call rotation, and every once in a while I get to play with Hadoop/HBase.
The AOL Mail System has been around for a long time, and went through a major re-architecture between 2010 – 2011. It’s not a 15 year old code base, and we evolve it constantly. We service over 70 million mailboxes in the AOL Mail environment today. That includes supporting our paying members, in addition to free accounts. Of course, member experience is our #1 priority. We have all kinds of tools in our proverbial utility belt, as we believe in trying to use the right thing for the right job.
It means we’re reasonably large. But we’ve also been operating “at scale” for a long time now. While we have been doing “Big Data” for a lot of years now, we got to our current size by operating a certain way: Rigid quality and change controls, lots of documentation, emphasis on uptime. As we have shifted toward being more agile, we have had to be careful with unproven technologies. HBase, for all the buzz, is still pretty young and error-prone. Some of the realities for dealing with a production Hadoop/HBase system would seemingly require a departure from our traditional mentality. Like everyone, we require stability and robustness of our production applications, but our way of getting there has had to change. Above all, however, we must still take care of our customers, so it’s a balancing act for us.
So HBase is one of the tools we’ve added to kit in the last few years that’s still proving itself. We’ve got two applications running and we’ve identified a few other places where it’s a good candidate to utilize. This isn’t to say that we are not using it for important things, but it’s not at the core of our system. We’ve managed to build a relatively stable platform over time. There’s a lot of scripted recovery, and a lot of proactive monitoring in our environment, and for the most part when there are problems, they are mitigated or resolved without even the involvement of an admin.
AOL Mail first stared looking into Hadoop and HBase back in mid 2010. Other business units in our company had been working with Hadoop for a while before then, and a little of intra-company consulting convinced us to give HBase a try. This system is one component our our anti-abuse strategy. I can’t reveal exactly what it does, but I can tell you a bit about how the HBase stuff happens. In addition to the 60 node cluster and the application servers there’s the ancillary junk which includes NameNodes (2x), HMasters (2x), Zookeepers (3x). The app hosts and Zookeepers, which are currently physicals, are being switched to virtual devices in our internal cloud.
This is what the application looks like. The “Service Layer” comprises various components within the AOL Mail system. They speak their own protocols and send messages to an “Event Catcher” which decodes the stream, and writes a log to local disk. That log is imported in Hadoop (and can optionally be sampled to a development sandbox at the same time) and then further cooked via MapRed which ultimately outputs rows into HBase, and can send triggers to external applications. One thing we can do at this point (not illustrated) is populate a memcache which may be used by client apps to reduce load on some HBase queries.
The real answer is that when we first started, we couldn’t make streaming a million and a half rows a minute work out with the Hbase we had two years ago. At the time, it was easier for us to build the batch loader, which has proven to have a few interesting advantages. Our next-generation model will rely on HBase itself being more stable, and will heavily leverage coprocessors to do a lot of what we’re doing now with MapReduce.
A big obstacle for us is getting MapReduce and HBase to play nicely together. From what I’ve seen, bigger hardware is starting to become more popular for running HBase, and we believe it’s essential. We’ve floated between an 8 – 16 GB heap for the RegionServer. For this application, I believe we’re currently using 16. Getting GC tuning and the IPC timeouts in HBase/Zookeeper correct are critically important. System tuning is also very important. Depending on which flavor of Linux you’re running, the stock configuration may be completely inappropriate for the needs of an HBase/Hadoop complex. In particular, look at the kenel’s IO scheduler, and VM settings.
This application was built a short while after we started our trial-by-fire with HBase on the previous application. It was a different development team with input from the engineers working on the previously discussed application. This application has the same “event catcher” layer for the same reasons, but it has always written directly to HBase. We import data into a “raw” table and then process that table with MapReduce writing the output into a “cooked” table. There’s a much lower number of events here, but it spikes up significantly during the MapReduce phase. It’s exactly the same class of hardware with the same ancillary junk as the previous app. Most of the query load is actually farmed out of memcache.
Yes, this is a relatively straight-forward design.
Exploding tables might be a better name for this, since it’s an across-the-board sort of thing. Backups, of course, are obvious. We’ve run into three catastrophic data loss events, actually once each with three different clusters. The first was during a burn-in phase for the Contact History application I described earlier. At that time the data it had accumulated over the week or so that it had been running wasn’t considered important essential so we were able to truncate and move along. Another time, for a separate plain Hadoop cluster, an unintentionally malicious user actually managed to delete my backups and corrupt the namenode’s event log. Luckily that data was restorable from another source. The last time was with the Activity Profiler application. Basically, having data backups saved the day.
This is our working model for a next-generation HBase system It is currently being prototyped with the cooperation of our Engineering and Operations teams The key design concept is to allow for a great deal of flexibility and re-use, and it centers around this idea of installing a fairly dynamic rules-engine at both the event collection and event storage layers. Hopefully will be presenting it soon