Lessons from the Cloud
Bryan Beaudreault, @HubSpotDev
You’re doing it
Improve reads, limit impact
Over-provision, fail fast
CPU heavy workloads
Reduce memory footprint
Add more servers
Excellent, but expensive
Use data encoding to reduce disk
Use Java7 and G1 to reduce GCs
m1.xlarge Memory heavy workloads
Master HBase with us.
Tech Lead on Data Ops at HubSpot
Talking about running HBase with real-time APIs
Specifically, what we’ve learned from running in EC2 for 3 years
Quickly, what is HubSpot?
Inbound marketing company. Most marketers cobble together: GA, mail chimp, wordpress
All-in-one marketing platform. Provide extra value due to context integrating tools provides.
Use of hbase: sending emails, analytics data, customer’s leads & contacts, internal tools
5 clusters, 10-30 nodes. 1 shared hadoop cluster
9 teams using HBase as their datastore. Each team owns hadoop jobs, kafka topics, and APIs hitting HBase.
Through 3 years of HBase operations, most days I’m doing things like…
Reading logs, looking at data, changing configs, and creating tools.
Digging into HBase code.
Making sure everything runs smooth and fast for our developers and customers.
And, as I’m sure you all can understand … I HATE
Being woken up at night. When we first started running HBase in the cloud…
We saw this pretty often.
Not always the same time, but: ruined dinner plans, all-day fire fighting sessions,
Sleep matters to me, and many nights, instead of sleeping, I found myself awake…
A lot of inter-dependencies. Contacts is used by everything. If that Hbase goes down…
There’s nothing fun about sitting in your cave in the dead of night, feverishly scrambling to get your entire product back online.
Some weeks after a few nights, you can feel a bit exasperated…
No one to call. Should running HBase in EC2 be this hard?
It’s a distributed system with lots of moving parts, running across multiple data centers.
We’re trying to mix real-time APIs with constant hadoop jobs.
Maybe this is the name of the game.
Wrong. Bigger companies are using HBase for bigger applications, and I don’t see them complaining.
They aren’t running in the cloud. Cloud isn’t the problem, but it’s obvious running in the cloud adds a whole set of variables. Servers degrade, HBase becomes unresponsive. HBase is not currently equipped to deal with all of these issues.
We can’t rely just on what HBase provides stability.
How should it really be?
This is really how it should be. Running in the cloud should be just like in any DC.
Sit back and watch it run.
Long road. We have gotten (mostly) there. No late night wake ups in months. Knock on wood.
HBase runs itself. Performance is great: thousands of API r/s sustained, hundreds of thousands of hadoop jobs per month.
How did we get there? There were a few challenges …
Honestly: depending on your use case, hard to cheaply get multi-9 uptime
A lot of respect for Pinterest: all writes to 2 clusters. Fail over as necessary.
But we couldn’t follow their model. What can we do?
Have to be proactive — augment HBase with your own automation. Limit issues, respond ASAP.
Dedicate at least one person to this until stable; give him/her support.
A single EC2 availability zone is multiple data centers. Network is good, but can fluctuate.
These fluctuations can cause a big problem, for reads and writes.
Writes go to the memstore, and written locally when flushed.
Though data is written locally, regions move all the time. When this happens..
Network is working against you, rather than for you.
Network graph starts to become..
Impossible to follow.
Let’s say one node disappears. What happens?
1+ network hops per read. Scan crosses multiple HFiles? Even more network hops.
Not great, but doable when you’re in a couple local racks. What happens when you’re across multiple data centers?
Bottom line: your 99th percent degrades, and the impact of 1 loss can be huge.
Region moves as a result of: RS dies; region splits; periodic balancer runs.
Each region move is more entropy, slower requests, slower recovery.
With this in mind, what can we do?
Maintain 100% locality always.
That is, make sure all region data is always written locally.
When you lose locality (RS dies), heal ASAP.
Always compact regions after moving them.
Maintaining locality will…
With short circuit reads that means straight to disk or memory.
Loss of a region server will still require failover, but that RS is no longer host data for other servers. So client requests to other RS will be mostly unaffected.
Overall you’re in a much better place.
How can we achieve this?
Default balances to keep region server load even. Doesn’t compact regions post-move.
Disable HBase balancer, and write your own. Use HBaseAdmin API to move and compact.
Using cost functions, prioritize Locality. Compact on move. Rate limit. Open-sourced.
Graceful shutdown: Hook into balancer. Compact on move
Disable splits. Track region moves and locality. Mention: 0.96.x: Stochastic load balancer.
You’re almost never alone. A single instance is part of a much larger neighborhood.
Instances are virtualized on a physical host.
Depending on your instance type, one physical host is shared with any number of neighbors.
Those neighbors are all doing their own thing. CPU intensive calculations; saturating their disk or networks.
And, despite virtualization…
These neighbors can have significant impact on your instance.
Disk slowness (unexplained iowait), cpu slowness (steal %), general server degradation
HBase will continue running through most of these issues.
Client calls build up, APIs start alerting. Impacting customers again.
How can we avoid (or mitigate) this?
HBase is good at this when a process or server just dies out right, because the ZK node will go away. Most EC2 failures don’t work like that though.
We run with 10-30% more than needed. The moment a server gives you issues, kill it
Try moving regions off, if it is slow just kill -9. Using hbase’s stop command may be too slow, if the host is having issues.. needs to flush memstore, etc. Just rely on WAL replay will be faster.
But maybe we can do even better…
Two http endpoints: JMX as JSON, RS-status as JSON
Region server status page can also print JSON.
We can write a simple script to parse these. Look for callQueueLen >= 10x RPC handlers.
Inspect the handlers from the RS-status output. Start logging.
We take thread and heap dumps every few seconds, and log things like cpu load, iowait, steal, network io.
This provides a lot of great data for debugging. Optionally add killing, by removing znode or kill -9 after some threshold.
It’s tempting to think of HBase as a catchall that can handle all your different use cases. It can do a lot, but it needs to be tuned accordingly.
Initially had 1 cluster, and it was a nightmare. Heavy writes of Analytics conflicting with heavy reads of Contacts.
Apps did not fit well together. Landscape constantly changing.
Any time we made a change to accommodate one team, it impacted every team using the cluster.
We tried to make it work for a while, and this actually caused us to write better, safer code.
Eventually it became too much work.
Broke them up. One of our best decisions.
Partition your clusters, separate your concerns. Do it by usage pattern, optimize each accordingly.
Systems like puppet make keeping these clusters similar easy. Libraries like fabric make customization easy as well.
Use LDAP to give each server a cluster name; sync all of our configs to S3 so clients can read them.
Easier to make decisions. Easier to operate. Easier to track down failures.
This partitioning goes for hadoop too.
Mentioned we run hundreds of thousands of hadoop jobs per month, most run against HBase.
Keeping them in control is critical for our real-time APIs
1 region = 1 mapper. 30 regions per regionserver might mean 30 mappers all running at once.
Wrote our own InputFormat and RecordReader, which groups all regions for a RegionServer onto 1 mapper (configurable).
Reducers already have idea of Partitioner. Use the HTable interface getRegionLocations to return RS mappings, and do same.
HubSpot APIs sustain multiple thousands of requests per second.
If a RegionServer is dying, or hadoop job is hammering, requests will hang in API.
With high concurrency, even with low timeouts, threads can pile up.
Starvation could bring down all API nodes, even though only a portion of data was really unavailable.
We know quick: we monitor threads very closely. codahale’s metrics library
But you shouldn’t respond manually, and there are patterns for this…
That’s where Hystrix comes in. It’s a circuit breaker from Netflix.
Modified HBase client to provide a circuit per-regionserver.
So region server slows down, open circuit to fail those requests, allowing others to succeed.
Hystrix will trickle requests to that RegionServer. Will close circuit when all is OK.
Also provides a great dashboard to get a view of latencies and r/s per regionserver from the client’s perspective.
Basically, there are no ideal instance types in EC2
Some have not enough CPU, some not enough memory, some not enough disk
Old generation is underpowered in memory and CPU,
New generation have extremely small SSD disks.
HBase is meant to run on commodity hardware, but this hardware should be configured appropriately. Most instance types weren’t designed with HBase in mind.
When you use the wrong size…
It won’t look as adorable as this. You’re gonna have a bad time.
HBase needs a certain amount of memory. Without, you face OOME and inefficient writes.
It needs enough CPU to handle compactions.
And your disk performance is critical for your overall read performance.
But you can make it work…
You just need to choose wisely, and realize it’s not hard to change.
Is your data set very dense? Do you do a lot of writes, or just a lot of reads? Are you bulk loading to run over with hadoop? It all depends on your workload. Do some testing.
We had our own progression at HubSpot…
Started with m1.xlarge. Couldn’t handle the compactions.
Moved to c1.xlarge. Struggled with memory: frequent small flushes, no page cache, oom killer
Fixed: reduced regions, aggressive caching/batching.
Recently released i2.4xlarge. Game changer, but expensive. Disk space issue
Replaced c1.xlarges, with a reasonable increase in cost. Was worth it for the stability. 25GB heap. 30-50% CPU . Low iowait.
Talk about use cases (m1xlarge == append only, etc). …
Metrics and data make all of these decisions a little easier.
HBase’s greatest strength, one of biggest weaknesses
Hundreds of metrics. Using them is like exploring a vast, uncharted territory. Mostly undocumented.
Metrics for almost everything you could want. Per region, per table, etc
A bit much, and overwhelming. Still doesn’t give the full picture.
Hard to visualize, hard to know what to look for to detect problems
A few that we found especially useful…
I have mentioned some throughout. Biggest ones for us are callQueueLen and client metrics.
fs latencies help to see problems in HDFS or disk.
Keep your queue sizes down.
Of course monitor OS-level metrics like steal, load, and free memory.
We store all metrics in OpenTSDB. Great datastore.
We found it helpful to be able to explore these more freely…
Colleague wrote lead.js, named after Graphite — it’s first integration. Open-source.
What is it? Frontend for time series data from systems like Graphite and OpenTSDB
Similar to IPython Notebook. Use coffee script to explore data.
Hover over graph to see values at a time. Hide and highlight series as needed. Explain example.
Available on github @ http://lead.github.io/
We haven’t been afraid to get scrappy and hack together the tools we need.
But we didn’t always have the answers, and at times learning HBase seemed insurmountable.
It really isn’t though, and doesn’t have to seem that way.
There has been a lot of development in the community since we started.
Now: Great docs, very active user list, and lots of external resources
On top of that, we at HubSpot are starting to open-source and talk about these things…
So I’d like to invite you to reach out to us.
Check out our blog, where we will be posting a lot more details in the coming days and weeks.
So we can all sit back, relax, and watch HBase run.