HBaseCon2017 Highly-Available HBase

Highly Available HBase
Micah Wylde
@mwylde
HBaseCon ‘17

What is Sift Science
Sift Science protects online businesses
from fraud using real-time machine
learning.
We work with hundreds of customers
across a range of verticals, countries,
and fraud types.

What is Sift Science?
sift
Customer
backend
bob added
credit card
bob
27
Carrier 12:00 PM
bob opened
app
Googlehttp://domain.com
Web Page Title
bob loaded
cart page fraud score

HBase at Sift
We use HBase to store all user-level
data—hundreds of terabytes.
We make hundreds of thousands of
requests per second to our online
HBase clusters.
Producing a risk score for a user may
require dozens of HBase queries.
600TB ● 48K regions ● 250 servers

Why HBase
• Scalable to millions of requests per second and
petabytes of data
• Strictly consistent writes and reads
• Supports write-heavy workloads
• Highly available …in theory

When we’re down, our
customers can’t make money

We went down a lot last year… mostly due to HBase

Since then we’ve mostly eliminated HBase downtime

Step 0: Prioritize reliability
(this means deferring new features)

Symptom:
When a single region server became unavailable
or slow, our application would stop doing work.

Replicating the issue
with Chaos Engineering
• Killing processes
• Killing servers
• Partitioning the network
• Throttling network on HBase port

Replicating the issue
with Chaos Engineering
$ tc qdisc add dev eth0 handle ffff: ingress
$ tc filter add dev eth0 parent ffff:
protocol ip prio 50 u32 match ip protocol
6 0xff match ip dport 60020 0xffff police
rate 50kbit burst 10k drop flowid :1
Sets the bandwidth available for HBase to 50 kb/s
(don’t try this on your production cluster)

What’s going on?
Proﬁling showed that all threads are
stuck waiting on HBase.
Even though just one HBase server is
down, our request volume is so high
that all handler threads eventually hit
that server and get stuck.
runnable
blocked
waiting

Circuit Breaking
A pattern in distributed systems where
clients monitor the health of the servers
they communicate with.
If too many requests fail, the circuit
breaker trips and requests fail
immediately.
A small fraction of requests are let
through to gauge when the circuit
becomes healthy again.
Closed
Open
Half-Open
trips breakersuccess
fail fast
make request
trips breaker
request fails
request
succeeds

How well does this work?
very eﬀective when one region server is unhealthy
circuit breaker
control

Circuit Breaking in hbase-client
Subclass RpcRetryingCaller / DelegatingRetryingCallable
private static class HystrixRegionServerCallable<R> extends
DelegatingRetryingCallable<R, RegionServerCallable<R>> {
@Override
public void prepare(boolean reload) throws IOException {
delegate.prepare(reload);
if (delegate instanceof MultiServerCallable) {
server = ((MultiServerCallable) delegate).getServerName();
} else {
HRegionLocation location = delegate.getLocation();
server = location.getServerName();
}
setter = HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey(REGIONSERVER_KEY))
.andCommandKey(HystrixCommandKey.Factory.asKey(
server.getHostAndPort()));
}
}

Subclass RpcRetryingCaller / DelegatingRetryingCallable
private static class HystrixRegionServerCallable<R> extends
DelegatingRetryingCallable<R, RegionServerCallable<R>> {
@Override
public R call(final int timeout) throws Exception {
if (setter != null) {
try {
return new HystrixCommand<R>(setter) {
@Override
public R run() throws Exception {
return delegate.call(timeout);
}
}.execute();
} catch (HystrixRuntimeException e) {
log.debug("Failed", e);
if (e.getFailureType() == HystrixRuntimeException.FailureType.SHORTCIRCUIT) {
throw new DoNotRetryRegionException(e.getMessage());
} else if (e.getCause() instanceof Exception) {
throw (Exception) e.getCause();
}
throw e;
}
} else {
return delegate.call(timeout);

Subclass RpcRetryingCaller
public static class HystrixRpcCaller<T> extends RpcRetryingCaller<T> {
@Override
public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException, RuntimeException {
return super.callWithRetries(wrap(callable), callTimeout);
}
@Override
public T callWithoutRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException {
return super.callWithoutRetries(wrap(callable), callTimeout);
}
private RetryingCallable<T> wrap(RetryingCallable<T> callable) {
if (callable instanceof RegionServerCallable) {
return new HystrixRegionServerCallable<>(
(RegionServerCallable<T>) callable, maxConcurrentReqs, timeout);
}
return callable;
}
}

Subclass RpcRetryingCallerFactory
public class HystrixRpcCallerFactory extends RpcRetryingCallerFactory {
public HystrixRpcCallerFactory(Configuration conf) {
super(conf);
}
@Override
public <T> RpcRetryingCaller<T> newCaller() {
return new HystrixRpcCaller<>(conf);
}
}
// override the caller factory in HBase config
conf.set(RpcRetryingCallerFactory.CUSTOM_CALLER_CONF_KEY,
HystrixRpcCallerFactory.class.getCanonicalName());

Replication
Circuit breaking helps us avoid
downtime when a small number of
region servers are unhealthy.
Replication allows us to recover quickly
when the entire cluster is unhealthy.
This most often occurs due to HDFS
issues or HBase metadata issues.
cluster 1
cluster 2
replication
application
zookeeperzookeeperzookeeper
cluster 1 is primary
primary connection
fallback

Replication
We keep active connections to all
clusters to enable fast switching. A
zookeeper-backed connection provider
is responsible for handing out
connections to the current cluster.
If we see a high error rate from a
cluster, we can quickly switch to the
other while we investigate and ﬁx.
This also allows us to do a full cluster
without downtime, speeding up our
ability to roll out new conﬁgurations
and HBase code.
cluster 1
cluster 2
replication
application
zookeeperzookeeperzookeeper
cluster 2 is primary
fallback
primary connection

Replication
Fail over between clusters takes less
than a second across our entire
application ﬂeet.
Connection conﬁguration is also stored
in zookeeper, so we can add and
remove clusters without code changes
or restarts.
requests per region server during switch

Replication
To verify inter-cluster consistency we
rely on map reduce jobs and online
client-side veriﬁcation.
We automatically send a small
percentage of non-mutating requests to
the non-active clusters using a custom
subclass of HTable, comparing the
responses to those from the primary
cluster.

Monitoring
We collect detailed metrics from HBase
region servers and masters using
scollector. Metrics are sent to
OpenTSDB and a separate HBase
cluster.
We also use scollector to run hbck and
parse the output into metrics.
Metrics are queried by Bosun for
alerting and Grafana for visualization.
Region ServerRegion ServerRegion Server
scollector
TSDRelayTSDRelay
TSDRelay
Region ServerRegion ServerMasters
scollector
Metrics
HBase
TSDRelayTSDRelay
Write TSDs
Bosun
Grafana
TSDRelayTSDRelay
Read TSDs

Monitoring
Total requests per region server (from
region server metrics) helps detect
poorly balanced regions.

Monitoring
99p latencies (from region server
metrics) can show region servers that
are unhealthy due to GC, imbalance, or
underlying hardware issues.

Monitoring
We closely track percent_ﬁles_local
(from region server metrics) because
performance and stability are aﬀected
by poor locality.

Monitoring
Inconsistent tables (reported by hbck)
can show underlying hbase metadata
issues. Here a region server failed,
causing many tables to become
inconsistent. Most recovered, but one
did not until manual action was taken.
Some consistency issues can be ﬁxed
by restarting masters, others require
running hcbk ﬁx commands.

Next steps
• Cross-datacenter replication and failover
• Automating recovery procedures (killing failing
nodes, restarting masters, running hbck commands)
• Automating provisioning of capacity

HBaseCon2017 Highly-Available HBase

More Related Content

What's hot

Similar to HBaseCon2017 Highly-Available HBase

More from HBaseCon

Recently uploaded

HBaseCon2017 Highly-Available HBase