DeathStar
Easy, Dynamic, Multi-tenant HBase via
YARN
Ishan Chhabra, Nitin Aggarwal
Rocketfuel Inc.
In a not so distant
past…
1000 node cluster
Rogue Applications
Cannot customize per
application
Hard to capacity plan
or support new
applications
Key Insight:
HBase Multi-Tenancy
and Access Patterns
HBase
Service
Online Operational Store
HBase
Data Pipeline 1
Mutable Materialized View
Stream 1
Data Pipeline 2 Data Pipeline 3
Stream 2
HBase
Prep
Stage
Transient Cache
Stage 1 Stage 2 Stage 3
The Common Solution:
Separate Clusters
Non uniform network usage
Different DFSs’, leading to lot of copying of data
Low cluster utilization
High lead time for new applications
Run HBase on YARN
Built on top of Slider
Solution:
DeathStar
Hangar
App Cluster 3
Provisioning
Model
App
Cluster 2
App
Cluster 1
HDFS + YARN
(grid/deathstar): $ git commit
Capacity planning and configuration discussion
Create simple JSON config
As applications mature from
hangar to their cluster
Dynamic Cluster:
Make API call to start, stop and scale cluster
Static Cluster:
Good to go
lsv-hangar (20)
Clusters
Today
lsv-arp (100)
HDFS (1000 machines)
YARN
lsv-factdata
(80)
lsv-rtb-aux
(100)
lsv-attribution
(80)
lsv-user-
features (60)
lsv-user-geo-
features (10)
lsv-helios-
hbase (10)
Strict Isolation
Common HDFS
Layer
Bulkload
MapReduce over snapshots
Fits into
organization’s
capacity planning
model
Dynamic
config and
cluster size
changes
Clusters out of thin air
Hot swap a new cluster (human error / corruption)
Easier HBase version upgrades and testing
Temporary scale up for backfill
“Dynamic” enables interesting
use cases
Key Challenges and Solutions
Another failure mode
Taken care of by
auto restarts
RM HA in the works
Early Days:
Bugs
Slider did not
acknowledge container
allocations correctly
Fixed recently in 0.8:
SLIDER-828
Not easily reproducible,
still debugging
Zombie Regionservers
Early Days:
Bugs
Long running apps a
secondary use case
Logging, an unsolved problem
Store logs on local disks, considering
ELK
Usability
YARN/Slider lack certain
scheduling constraints
At most x instances per node for
spread and availability
Custom patch in-house
Rolling restarts for
config changes
Solved recently in 0.8: SLIDER-226
Data Locality
Metrics
Reporting
Custom hadoop metrics
OpenTSDB reporter
App name passed via
config
Multi-Tenancy?
HBase
Data Pipeline 1
Mutable Materialized View
Stream 1
Data Pipeline 2 Data Pipeline 3
Stream 2
Conclusion:
Is it for me?
Conclusion:
Is it worth it?
Thank you!
Questions?
Reach us at:
ishan@rocketfuel.com
naggarwal@rocketfuel.com

DeathStar: Easy, Dynamic, Multi-Tenant HBase via YARN

Editor's Notes

  • #2 Welcome to this talk on deathstar, our In house solution to easily and instantly provision hbase clusters via YARN.
  • #3 to understand why we are where we are today, lets go a little back in time. not too far, just a year go.
  • #4 1000 node HBase cluster co-located with our hadoop analytics cluster Powered lots of interesting and business critical applications (“high rising skyscrapers”)
  • #5 But once in a while, a rogue application would destroy the cluster, giving us sleepless nights It had become so bad that we ourselves were uncomfortable recommending usage or deploying our own application.
  • #6 Additionally, lot of important properties for HBase (like block cache, memstore, etc) are at the regionserver (worker) level, and cannot be customized for every application in a shared cluster eg. some applications need more block cache, some don’t need memstore due to bulkloads, etc.
  • #7 * Hard to understand what resources of the cluster an application is consuming in terms of compute, memory, network, and hard to capacity plan growth of the cluster and its applications.
  • #8 * Given these set of challenges that we were facing, we took a step back and tried to understand how applications were using HBase.
  • #9 * Use Case 1: People would build services that use HBase as their main storage engine. Only the service interacted with the cluster. No sharing.
  • #10 Use Case 2: HBase is very good at storing large amounts of data and providing point updates and reads Multiple streams of incoming data, joined and aggregated, and available to multiple data pipelines to consume further. Usually single writer, multiple readers.
  • #11 Use Case 3: Large out of memory cache for a data pipeline (series of mapreduce or spark jobs) Prep stage would load the data into the cache Further stages would use the data Data is not needed after the pipeline is finished, and can be cleaned up.
  • #12 Given these access patterns, we noticed that most tables do not need to be shared among multiple entities. Usually this problem is solved in some companies by having separate clusters.
  • #13 But we did not like that due to these problems High lead time reduces productivity, and slows down engineering. We like to move fast. We buy a standard machine. Some applications require high memory, others don’t. Often resources in the individual clusters are wasted. Data written to HBase cluster by MR jobs. Lot of copying between different cluster DFSs’. Copying of data between clusters usually leads to non uniform distribution of network traffic, and can bottleneck TOR switches
  • #14 Hence, we decided to skip the separate clusters solution, and jump onto running HBase in YARN containers. I hope I don’t have to clarify what is YARN to this audience.  We build the solution on top of the slider projects.
  • #15 We setup a simple provisioning model to make it easy for new applications to get started. The cluster runs a base layer of HDFS and YARN. We have a 30 odd HBase cluster called “Hangar” which is free for experiment on and has no SLAs. So if my girlfriend dumps me, I may decide to overcome my anger by wiping out the cluster, and nobody can complain. Fortunately that hasn’t happened yet.  Coming back, once a developer has built a prototype, we work with them and provision a separate cluster for them.
  • #16 As applications mature from hangar to their cluster We first sit down with the developer to understand how many HBase regionservers does the application need. What kind of configuration it needs in terms of block cache, memstore, RPC handler threads, and other tunable HBase parameters. The developer then creates a simple config and commits it into our codebase.
  • #17 Now there are 2 possibilities: If it is a static cluster (a cluster that is always running), then you are good to go. The system automatically brings up the cluster. If it is a dynamic cluster (a cluster that be started and stopped at any time), then the applications makes the API calls to start, stop and scale the cluster.
  • #18 Using this model we are running 8 clusters in production today, and growing. The clusters vary in size and configuration, from 10 containers, all the way up to 100 containers.
  • #19 * We found various benefits in on moving to this model, some which we had imagined, some that we hadn’t.
  • #20 We try to go with model of one cluster per HBase cluster as much as possible. This gives us strict isolation between various applications, and prevents the “godzilla” problem that we mentioned before.
  • #21 All of these clusters share a common HDFS layer, which is also used by our MR and spark jobs. This avoids the problems of data copying and non-uniform network usage. A lot of our data pipelines load data into their HBase clusters via bulkloads, and run mapreduce over snapshots for fast access, which becomes very easy given the shared HDFS cluster.
  • #22 This provisioning model plays well with our organization’s capacity planning model. Each team and their subteams get a share of the YARN cluster, which is enforced via hierarchical queues in YARN. A developer gets resources assigned to his team to provision his cluster, and can negotiate for more with his team lead as needed.
  • #23 Being able to change the size of a cluster and its configuration via a simple config instantly is truly liberating for developers. For example, this is just anther day in the life of the developer where he is increasing the number of nodes, and decreasing the block cache and memory allocation for his application as the application needs more RPCs and would not benefit from block cache. These changes are easy and frequent.
  • #24 * And finally, one can quickly bring clusters out of thin air, and dissolve them as needed.
  • #25 Being able to dynamically bring up and tear down clusters with different configurations enable many interesting use cases that we see in production. It becomes much easier to bring up a temporary cluster with a newer version of HBase to test applications, making HBase version upgrade process easier and surprise free. We have also interestingly seen cases where an application has written some bad or incorrect data, and wants to backfill and rewrite data. The cluster is provisioning to sustain read/write throughput of standard runs of an application, and it would take a lot of time to backfill data. In these cases, we simply scale up the cluster temporarily so allow for faster backfills of data, and then scale it down to support the usual throughput, making the process a lot easier for the developer. Finally, in cases where is massive corruption of data due to human error, one can simply create a new cluster and bring it up to speed, and hot swap it with the existing cluster for an application.
  • #26 We created this solution due to our urgent needs, and started a year ago at a time when the YARN and slider were still very young. They are still very young. Lets discuss some of the key challenges, some of which are solved, and some of which aren’t. This is to give you a fair idea so that you can take an informed decision if you take this route.
  • #27 By adding another system, YARN, between HBase and bare metal, we have added another set of failure modes. We have seen RM failures and restarts more than we would like to in production. This is taken care of today by our monitoring engine that automatically restarts clusters when they fail, so that we don’t have to manually. RM HA is in the works on the production YARN cluster to reduce failures.
  • #28 Slider is still a young project, and there can be critical bugs that can affect the entire YARN cluster and HBase applications. For example, slider had a bug where it would hoard containers and not acknowledge allocations correctly to YARN. This lead to undesirable preemption in our YARN cluster for other users. The problem was reported and fixed upstream, and the solution is a part of slider 0.8, released recently.
  • #29 We also infrequently see this weird issue of zombie regionservers running on machines, when they should have shutdown properly. Not easily reproducible, we are still debugging and will push fix upstream when we find it.
  • #30 A more fundamental problem is that long running applications have traditionally been a secondary use case for YARN, given that MR was the first and most widely used application on YARN. This is visible sometimes. Eg., till recently, YARN would aggregate the logs and store them on HDFS when the application finished, which makes sense for MR, but not for long running applications. Long running applications are long running and create huge logs. This would create huge logs on local disks, the aggregation process would invariably crash due to huge amounts of data copying, and the logs would be available for inspecting in real time. This was recently improved to log rotation and move after every 12 hours in YARN 2.6, but it is still not enough. Logs are very important for us to understand the behavior of HBase clusters, and so we decided to store them on a local disk on the node running the regionserver in a unique directly, and we login today directly into the machine to check logs when needed. We are considering ELK now to aggregate and search the logs in real time more easily.
  • #31 In general, usability is lacking given that slider is such a young project, and we have ended up building various parts around it. Eg it is hard to find the location of master page given that it is provisioned on a random node with a random port, and requires a bunch of steps. We build a simple UI to programmatically locate this and provide a simple “jump board”. Similarly, Slider requires one to specify all the configurations for every HBase cluster. We build a hierarchical config system where we maintain a set of sane based configurations for Hbase, and very application specific config essentially inherits and overrides parts as needed. And so on.
  • #32 Another big area of improvement needed for Slider and YARN is to add various kinds of scheduling constraints. For example, we needed the notion of scheduling at most x (usually 2-3) instances of Hbase servers on a single physical machine, to increase availability of the cluster in case of machine failure. When we started, slider would usually get almost all of the required containers on 2-3 nodes! We have a custom patch in house to support this, which we are working on contributing upstream.
  • #33 Slider/YARN did not have the ability to rolling restart an Hbase cluster, which is we find very useful to push config changes to a live cluster. This has been added recently to Slider and YARN.
  • #34 Another important point to consider is the locality of data on HDFS and scheduling of containers. There is no support for this today, and the containers are scheduled randomly. However, if the containers are stable and not killed, you end with locality due to compaction and new writes going to local disk first. We have been thinking about making changes to Slider and HBase to get maximum locality. Slider would try to schedule containers onto the machines that have the data for the cluster, and a custom region balancer would try to maximize data locality for the regions. We haven’t felt a strong need for it, yet.
  • #35 Finally, Hbase metrics are very important for us to fix problems and understand running clusters in general. We use OpenTSDB in house for our metrics storage and visualization needs. To make this new model play well with OpenTSDB, we created a custom hadoop metrics reporter to send metrics to OpenTSDB. We pass in the application name via config, which makes it easy to differentiate metrics for different HBase clusters in the UI.
  • #36 * Now that we have made multi-tenancy easy for use cases where different applications are using different HBase clusters, what about the case where they want to a share an HBase table.
  • #37 For example, what about the application pattern we talked about earlier of creating a mutable view by combining and aggregating multiple streams of data which is accessed by multiple data pipelines. We currently don’t face much problems for this use case today since the data pipelines usually use mapreduce over snapshots, which does not impact HBase Regionservers. We have started to some early problems though for some use cases, and we plan to solve them by using features like per user RPC queues in HBase, and adding our in similar vein as needed.
  • #38 So finally, having seen all the challenges behind the scenes, do I think this is right solution for people in the audience? If you are an intermediate to advanced user of HBase with multiple applications, then I definitely would say yes. The stack is maturing quickly, and the value provided is immense.
  • #39 And is it worth it? Hell yeah! If you want to get some good night sleep.