With multiple clusters of 1,000+ nodes replicated across multiple data centers, Flurry has learned many operational lessons over the years. In this talk, you'll explore the challenges of maintaining and scaling Flurry's cluster, how we monitor, and how we diagnose and address potential problems.
2. ● 2008 - Flurry Analytics for Mobile Apps
○ Sharded MySQL, or
○ HBase
● Launched on 0.18.1 with a 3 node cluster
● Great Community
● Now running 0.98.12 (+patches)
● 2 data centers with 3 clusters each
● Bidirectional replication between all
History
3. How we use HBase
Flurry
SDK
PROCESSING
PIPELINE
MOBILE
ADVERTISING
4. In each datacenter we have 3 hbase clusters.
Our Clusters
1400 Nodes 800 Nodes 60 Nodes
128GB RAM, 4 drives (4TB each) , 10GigE, 2CPU x 6Core x 2HT = 24 procs
Each Machine: RegionServer, NodeManager, DataNode
1 Table
60k regions
1.2PB (LZO Compressed)
37 Tables
115k regions
400TB (LZO Compressed)
1 Table
10k regions
2TB (LZO Compressed)
Ingestion Pipeline
MapReduce Jobs
Random Reads
Ingestion Pipeline
MapReduce Jobs
Random Reads
Ingestion Pipeline
Low Latency Random Reads
99% <= 1ms, Max TP: 1MM rps
5. 1. Start replication to the new datacenter
2. Backfill using Copy Table
Data Migration Attempt #1
Old DC New DC
Replication
CopyTable MR Job
Replication
Old DC New DC
6. Pros
● Easy from an operational standpoint
● One Map/Reduce job per table does the job
● No extra copies of the data to keep around
Cons
● Job Failure == Starting Over
● Shipping uncompressed data over the wire
● Slow
Data Migration - Attempt #1
7. 1. Start replication to the new datacenter
1. Snapshot Table
Data Migration - Attempt #2
Old DC New DC
Replication
Replication
Old DC New DC
Snapshot
8. 3. Export Snapshot
4. Bulk Load into destination cluster
HDFS
Data Migration - Attempt #2
Old DC New DC
Replication
Snapshot
Exported
Snapshot
ExportSnapshot MR Job
HDFS
Old DC New DC
Replication
Exported
Snapshot
Bulk
Load
9. Pros
● Shipping compressed data over the wire
● Can easily modify the code such that if the Export Job fails, you can resume
where you left off
● Much faster than copy table if your compression rates are good
Cons
● When you compact with snapshots you keep the old HFiles
● Potentially storing 2x original table on disk.
● More operational steps than copy table.
● Possibility of resurrecting deleted data.
● Snapshot Data - Delete data - Major Compact - Import Snapshot
Data Migration - Attempt #2
10. 1. Start replication to the new datacenter
2. Partial Snapshot Table (HBASE-13031 - Ability to snapshot based on a key range)
Data Migration - Attempt #3
Old DC New DC
Replication
Old DC New DC
Snapshot Replication
Snapshot
Snapshot
11. 3. Export Snapshot
3. Bulk Load into destination cluster
Data Migration - Attempt #3
Replication
Multiple ExportSnapshot MR Jobs
HDFS
Old DC New DC
Snapshot
Snapshot
Snapshot
Exported Snapshot
Exported Snapshot
Exported Snapshot
Exported Snapshot
HDFS
Old DC New DC
Replication
Exported Snapshot
Exported Snapshot
Multiple Bulk Load runs
12. Pros
● Same as the previous attempt
● If you have large tables and limited space in DFS you can snapshot a key
range, thus limiting the amount of duplicate storage at any time.
Cons
● Adding even more operational overhead.
● Still possibility of resurrecting deleted data
Data Migration - Attempt #3
13. No downtime: [Hadoop-1.x,HBase-94.x] => [Hadoop-2.x, HBase-98.x]
Issues we had to iron out
● How to Migrate Data?
● HBase-94.x (Writable) HBase-98.x (Protobufs)
● Hadoop-1.x <-> Hadoop-2.x (Can’t push data, must use HFtp/WebHDFS)
● Snapshots are not compatible between HBase-94.x and HBase-98.x
● Client code compatibility
● Must be compatible with both Hadoop versions and HBase versions for some time
● Migrating our HBase jobs from Mapreduce -> YARN
● We had a few patches to Hadoop which protected HBase which no longer apply
● max_concurrent_map_tasks
● max_tasks_input_split
Upgrading our cluster
14. The Requirement
Deploy a single code base to either 0.94 or 0.98 clusters
Good News: Most of the API calls are identical, so everything can
resolve properly at runtime!
Bad News: … most, not all.
Migrating Client Code from 0.94 to 0.98
15. What we did
Separated out our HBase client code from the rest of the project
Forked that library to have separate 94 and 98 versions
Changed our build process to include either version depending on
which cluster we’re building for
Migrating Client Code from 0.94 to 0.98
VS
16. Serialization changed significantly (Hadoop Writable -> Protobufs)
Input value types changed (KeyValue -> Cell)
Solution: Added an abstract base class to handle these differences
0.94 to 0.98 - Filters
0.94 Filter 0.98 Filter
17. Instantiation changed too - now each Filter needs its own static Factory
method which is found via reflection in the RegionServer code
Adding this method causes no backwards compatibility issues!
0.94 to 0.98 - Filters
18. New reversed field on base Filter class broke our serialization unit tests,
which expects transient fields to be marked transient even if they aren’t used
in Java Serialization
See HBASE-12996
0.94 to 0.98 - Filters
Why aren’t you transient??
19. - In 0.94 HTables were heavy, so we cached them
- We maintained a singleton cache of HTables so we wouldn’t have
to reinstantiate them repeatedly
0.94 to 0.98 - HTable Caching
TableInstanceCache
● Map<TableName, HTable>
● Lazy initialization
HBaseDAOs
- setAutoFlush()
Give me an HTable!
20. But in 0.98 HTables are light and you’re expected to create/destroy
them as needed.
We still use that cache because we make heavy use of batching
writes via setAutoFlush(false)
HTableMultiplexer is the 0.98 way to do this, but has a few issues
that prevented us from moving towards it:
1. Backwards API compatibility with 0.94
2. Dropping requests (i.e. put() returns false) if the queue is full
3. No flush() sync point to ensure all buffered data is written
0.94 to 0.98 - HTable Caching
21. Problem
● Adding Removing Racks causes the balancer to move way too many regions
Solution
● Patch the balancer to limit the number of regions moved per run
Problem
● Regions not written to for a while with only 1 store file were not candidates for
major compaction, thus potentially having non-local blocks
Solution
● HBASE-11195 - Potentially improve block locality during major compaction for
old regions
Problem
● Balancer does not respect draining nodes
Solution
● HBASE-10528 - DefaultBalancer selects plans to move regions onto draining
nodes
Some Operational Issues and Solutions
22. Ian Friedman - ianfriedman@yahoo-inc.com
Rahul Gidwani - rahulgidwani@yahoo-inc.com
Questions?
Editor's Notes
Dave Latham had a choice way back in 2008, sharded mysql or hbase 0.18 and boy are we happy he went with hbase. We have scaled quite a bit since those days and we don’t think that would’ve been possible if Dave would have gone with a sharded MySQL app
We have a flurry sdk which app developers install -> api boxes -> log processing pipeline (kafka) -> hbase
Uses HBase:
All of our mapreduce jobs
developer portal (website)
mobile ads
We run with 5 zookeeper instances, 1 namenode, 1 SNN, 1 hmaster, 1 backup hmaster
We started off as a 3 node cluster
Grew to 800 nodes in a single datacenter
decided we needed redundancy, so we spun up a new datacenter
(oh the joys of replication - things we found at scale)
HBASE-8806 - Row locks are acquired repeatedly in HRegion.doMiniBatchMutation for duplicate rows.
HBASE-9208 - ReplicationLogCleaner slow at large scale
HBASE-10100 - Hbase replication cluster can have varying peers under certain conditions (transient dns errors caused some hosts to have UnknownHostExceptions) this just logged the line
How consumers interact with the cluster:
Our large cluster holds raw fact data, which we ingest data to and run jobs to create summarized data which we store in our 800 node cluster.
Our 800 node cluster has denormalized data from large cluster table which we need for aggregation jobs, serving and holding secondary indexes.
Our 60 node cluster is a low latency key value store which we can hold all of the data in the block cache.
Why do we need to do migrations? Adding data centers, Merging large tables into one, major version hadoop and hbase…. all with zero downtime,
The biggest bottleneck in transferring data is throughput of your pipe. If you have relatively good compression rates, this can be quite inefficient. But the operational complexity is minimal provided you don’t have to babysit your long running job.
Here we again start replication to the new cluster. Take a snapshot of the table(s) we want to ship.
After the snapshot has been transferred to the new datacenter, we have a patched bulk load which allows us to take that exported snapshot and bulk load it into the tables we specified.
If your compression ratio is good this will save time, also we modified the exportSnapshot job a bit such that job failures could be resumed on job failure.
Another worry was that if time_between_major_compactions < time_taken_to_ship_snapshot you can potentially be storing 2x the data on disk as the hlog cleaner will skip the files that are snapshot.
Another concern is you can resurrect deleted data:
Suppose you snapshot your data
Run a delete
Major compact on your destination side
Then you import your snapshot
This is something we were willing to live with. You can also run a modified VerifyReplication on your data after shipping your data. Suppose you run a verifyReplication on your data after on the destination cluster. Then all rows or keys that are missing on the source cluster but exist on the destination cluster could be deleted.
So our problem was that we have a very large cluster which has quite a bit of data. Even with a 10Gbs dedicated link it takes nearly a month to ship the data to a new datacenter. Snapshotting those tables with writes to the original tables would have put us over the limit in terms of dfs capacity. We have 1 really large table which is 1.2PB un-replicated on our cluster it was going to be impossible to ship this. We thought about running the job where we exported to sequence files (compressed some more), then distcp’ed them over then ran the import job. This was a viable option upon running import we would have a 1.2PB shuffle which was not desirable. So we decided to split the table up into 4 smaller batches and ship those. The process went as such:
Start Replication
Take Partial Snapshot
Export Partial Snapshot
Bulk Load Partial Snapshot
Delete partial snapshot
Rinse and repeat
This actually provided us the best of both worlds.
We were able to ship compressed data over the wire thus saving weeks off our data transfer time
We didn’t have to babysit a job whereas if it failed, we would have to start over
We weren’t sitting awake late at night worried about whether or not we would run out of dfs capacity.
Francis at Yahoo had had used thrift to replicate between different versions of hbase. We modified this patch a bit to suit our needs as we could not afford to take any cluster downtime to make this happen.
used webhdfs, and because hfile format did not change this worked out for us.
used shaded jars to do verify replication jobs between clusters.
Our client code had to be compatible with both versions of hadoop / hbase while remaining encapsulated from our developers.
Another issue was how do we work with YARN.
All of our jobs scan our hbase tables and write back to our hbase tables. In MapReduce V1 we had some patches to protect our cluster.
a configurable option on a per job basis max_concurrent_map_tasks which limited the amount of concurrent tasks running for any particular job. This patch was mainly to protect things like our database from having too much load sent to it by some jobs.
a configurable parameter max_tasks_input_split. When you use the map reduce framework with hbase, your input splits are based on regions which you can aggregate to figure out how many tasks are slated for a particular regionserver.
But if some jobs are particularly IO intensive and hammer regionservers you might want to do some sort of rate limiting, but until recently regionservers did not have that capability. So you can limit the number of concurrent tasks running on any regionserver per job.
So since we had to maintain multiple data centers running on different versions, that means we would have to maintain a build of our software that could work on Hadoop 1 and HBase 94, and simultaneously maintain a branch that would work on Hadoop 2 and HBase 98, potentially for months as we worked through the datacenter migrations. So just to see what the delta was, we replaced all of the dependency jars in our project and saw what broke at compile time. Thankfully, not too much changed in HBase between 94 and 98! For the most part, everything would resolve properly at runtime, but this didn’t work for everything. Quite a few method signatures changed, new types were introduced, abstract classes gained/lost members, etc. So we had to come up with a plan to handle maintaining compatibility on both sides for potentially a long time.
Our interfaces that touch the HBase client code are mostly encapsulated, securely behind DAOs and other utility classes. So we extracted all of the base classes that actually touch endpoints like HTable and such, and put them into their own project, and set up a separate build chain with Maven and Jenkins and Artifactory to be able to build this library independent from the rest of our code base. We then forked the project and set up one fork to depend on HBase 94, and the other on HBase 98. Then we introduced variables into our build process for the main flurry projects to be able to inject a switch that would either pull the current version of hbaseCommon 94 or 98. So then we could just have a checkbox on our Jenkins build and deploy jobs and everything is handled for us.
So now I’d like to give you a taste of some of the incompatibilities we encountered when updating our client code from 94 to 98. If you’re planning to make an upgrade like this any time soon, this may help you out. Obviously your mileage may vary and our usecases are not the same as everyone else’s, but you might run into some of these hurdles.
We heavily use server side filters in our application, and maintain a bunch of them. A few things changed there, some method signatures, but the big deal for me was figuring out the new serialization changes.
For those who don’t use Filters or don’t know how this works, filters have to be able to serialize themselves to be sent to the RegionServer along with a Scan. When I say serialize themselves, what that really means is that they capture any arguments or input values and serialize those, and send them in the request. In 94, this was handled via the Hadoop Writable interface, and implementing custom read/write formats. However in 98, because we now use protobufs to communicate between client and server, the Filters need to be wrapped in a protobuf enclosure even though they have to handle their serialization themselves.
As you can see in the slide, we added another abstract class in between the base HBase Filter class and our own code that takes care of some of the method signature and name differences.
Now furthermore, in order to instantiate a filter from a serialized buffer on the regionserver side, HBase 98 requires that the filter class contain a static factory parseFrom method, when previously it just used the regular constructor plus the Hadoop Writable readFields method. Luckily this was a purely additive change, since no such method existed in the 94 world, we could just go ahead and add this static method to all of our Filter subclasses directly, with no worries about backwards compatibility with the 94 client code. Sadly we had to put a slightly different copy into each class so it could call the appropriate constructor. Furthermore, since the base Filter class’s parseFrom throws no exceptions, we had to catch exceptions from our deserialization method and rethrow as a RuntimeException.
This one really puzzled me… Our serialization methods all have unit tests around them that use reflection to automatically populate a random instance of a serializable class, serialize it, then deserialize it, and do a deep comparison of the result. But because the Filter base class gained this new reversed field, those tests broke. After dreading we’d have to add reversed to the serialization of our filters, I did some reading and realized that reversed only ever gets set on the RegionServer itself, and never actually gets sent as part of a request over the wire. So clearly it should be transient, if it’s not intended to be included in serialization? Well I guess we’re the only ones who use the java keyword transient as a hint not just for Java serialization but for any other kind of serialization, since our tests ignore fields marked transient. We both implemented a way to exclude individually named fields from our tests as well as opening a HBase jira to add the transient keyword there just in case anyone else has similar usecases.
So another thing we looked at changing was our use of the HTable interface. In 94, the common wisdom was that creating a new HTable was an expensive operation, so we lazily cached them as necessary. We also wrapped the base HTable class with a semaphore to make it thread-safe, so that multiple threads could issue writes at the same time. This let us turn off autoflushing and buffer our writes from multiple threads, and see a big performance gain in write throughput.
Now in HBase 98, the expectations are different. Since connections to regionservers are pooled and cached behind the scenes, HTable creation and destruction are light operations, and callers are expected to create and destroy them as needed for each set of calls. However doing something like this would break the assumptions of our use cases, as I mentioned in the previous slide we’re making big use of setAutoFlush false to buffer and batch our writes.
It seems like the 98 way to do this is to use HTableMultiplexer, but I identified several problems with trying to introduce this to our codebase.
the calling API into HTM is very different from our use of HTable, so it would require a lot of modifications. By itself this would not be a dealbreaker, except…
Unlike HTable which blocks when its buffer is full and it is trying to flush to HBase, HTableMultiplexer just returns false if a put would overflow the buffer… I decided this would be too much of a difference for how the calling code expects its HBase client to behave. But on top of that...
There is no way to ensure all data is flushed like there is with the flush() method in HTable. The best way I could come up with would be to stop all writes and then call getHTableMultiplexerStatus, which seemed like an expensive operation to be doing on a regular basis.
So we stuck with the original plan and cached our HTables and so far that seems to work as expected.
Now I’ll throw it back over to Rahul
Without good monitoring you can’t pinpoint your problems. It becomes especially hard to figure this out of 1000+ node clusters.
We store our data in tsd.
Your cluster is only as good as your poorest performing regionserver. Finding it quickly, diagnosing what is wrong is the trick.
Jstack is your friend, always profile and figure out where the bottlenecks are happening.