Highly Available HBase
Micah Wylde
@mwylde
HBaseCon ‘17
What is Sift Science
Sift Science protects online businesses
from fraud using real-time machine
learning.
We work with hundreds of customers
across a range of verticals, countries,
and fraud types.
What is Sift Science?
sift
Customer
backend
bob added
credit card
bob
27
Carrier 12:00 PM
bob opened
app
Googlehttp://domain.com
Web Page Title
bob loaded
cart page fraud score
HBase at Sift
We use HBase to store all user-level
data—hundreds of terabytes.
We make hundreds of thousands of
requests per second to our online
HBase clusters.
Producing a risk score for a user may
require dozens of HBase queries.
600TB ● 48K regions ● 250 servers
Why HBase
• Scalable to millions of requests per second and
petabytes of data
• Strictly consistent writes and reads
• Supports write-heavy workloads
• Highly available …in theory
When we’re down, our
customers can’t make money
We went down a lot last year… mostly due to HBase
Since then we’ve mostly eliminated HBase downtime
How?
Step 0: Prioritize reliability
(this means deferring new features)
Circuit Breaking
Symptom:
When a single region server became unavailable
or slow, our application would stop doing work.
Replicating the issue
with Chaos Engineering
• Killing processes
• Killing servers
• Partitioning the network
• Throttling network on HBase port
Replicating the issue
with Chaos Engineering
$ tc qdisc add dev eth0 handle ffff: ingress
$ tc filter add dev eth0 parent ffff:
protocol ip prio 50 u32 match ip protocol
6 0xff match ip dport 60020 0xffff police
rate 50kbit burst 10k drop flowid :1
Sets the bandwidth available for HBase to 50 kb/s
(don’t try this on your production cluster)
What’s going on?
Profiling showed that all threads are
stuck waiting on HBase.
Even though just one HBase server is
down, our request volume is so high
that all handler threads eventually hit
that server and get stuck.
runnable
blocked
waiting
Circuit Breaking
A pattern in distributed systems where
clients monitor the health of the servers
they communicate with.
If too many requests fail, the circuit
breaker trips and requests fail
immediately.
A small fraction of requests are let
through to gauge when the circuit
becomes healthy again.
Closed
Open
Half-Open
trips breakersuccess
fail fast
make request
trips breaker
request fails
request
succeeds
How well does this work?
very effective when one region server is unhealthy
circuit breaker
control
Circuit Breaking in hbase-client
Subclass RpcRetryingCaller / DelegatingRetryingCallable
private static class HystrixRegionServerCallable<R> extends
DelegatingRetryingCallable<R, RegionServerCallable<R>> {
@Override
public void prepare(boolean reload) throws IOException {
delegate.prepare(reload);
if (delegate instanceof MultiServerCallable) {
server = ((MultiServerCallable) delegate).getServerName();
} else {
HRegionLocation location = delegate.getLocation();
server = location.getServerName();
}
setter = HystrixCommand.Setter
.withGroupKey(HystrixCommandGroupKey.Factory.asKey(REGIONSERVER_KEY))
.andCommandKey(HystrixCommandKey.Factory.asKey(
server.getHostAndPort()));
}
}
Circuit Breaking in hbase-client
Subclass RpcRetryingCaller / DelegatingRetryingCallable
private static class HystrixRegionServerCallable<R> extends
DelegatingRetryingCallable<R, RegionServerCallable<R>> {
@Override
public R call(final int timeout) throws Exception {
if (setter != null) {
try {
return new HystrixCommand<R>(setter) {
@Override
public R run() throws Exception {
return delegate.call(timeout);
}
}.execute();
} catch (HystrixRuntimeException e) {
log.debug("Failed", e);
if (e.getFailureType() == HystrixRuntimeException.FailureType.SHORTCIRCUIT) {
throw new DoNotRetryRegionException(e.getMessage());
} else if (e.getCause() instanceof Exception) {
throw (Exception) e.getCause();
}
throw e;
}
} else {
return delegate.call(timeout);
Circuit Breaking in hbase-client
Subclass RpcRetryingCaller
public static class HystrixRpcCaller<T> extends RpcRetryingCaller<T> {
@Override
public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException, RuntimeException {
return super.callWithRetries(wrap(callable), callTimeout);
}
@Override
public T callWithoutRetries(RetryingCallable<T> callable, int callTimeout)
throws IOException {
return super.callWithoutRetries(wrap(callable), callTimeout);
}
private RetryingCallable<T> wrap(RetryingCallable<T> callable) {
if (callable instanceof RegionServerCallable) {
return new HystrixRegionServerCallable<>(
(RegionServerCallable<T>) callable, maxConcurrentReqs, timeout);
}
return callable;
}
}
Circuit Breaking in hbase-client
Subclass RpcRetryingCallerFactory
public class HystrixRpcCallerFactory extends RpcRetryingCallerFactory {
public HystrixRpcCallerFactory(Configuration conf) {
super(conf);
}
@Override
public <T> RpcRetryingCaller<T> newCaller() {
return new HystrixRpcCaller<>(conf);
}
}
// override the caller factory in HBase config
conf.set(RpcRetryingCallerFactory.CUSTOM_CALLER_CONF_KEY,
HystrixRpcCallerFactory.class.getCanonicalName());
Replication
Replication
Circuit breaking helps us avoid
downtime when a small number of
region servers are unhealthy.
Replication allows us to recover quickly
when the entire cluster is unhealthy.
This most often occurs due to HDFS
issues or HBase metadata issues.
cluster 1
cluster 2
replication
application
zookeeperzookeeperzookeeper
cluster 1 is primary
primary connection
fallback
Replication
We keep active connections to all
clusters to enable fast switching. A
zookeeper-backed connection provider
is responsible for handing out
connections to the current cluster.
If we see a high error rate from a
cluster, we can quickly switch to the
other while we investigate and fix.
This also allows us to do a full cluster
without downtime, speeding up our
ability to roll out new configurations
and HBase code.
cluster 1
cluster 2
replication
application
zookeeperzookeeperzookeeper
cluster 2 is primary
fallback
primary connection
Replication
Fail over between clusters takes less
than a second across our entire
application fleet.
Connection configuration is also stored
in zookeeper, so we can add and
remove clusters without code changes
or restarts.
requests per region server during switch
Replication
To verify inter-cluster consistency we
rely on map reduce jobs and online
client-side verification.
We automatically send a small
percentage of non-mutating requests to
the non-active clusters using a custom
subclass of HTable, comparing the
responses to those from the primary
cluster.
Monitoring
Monitoring
We collect detailed metrics from HBase
region servers and masters using
scollector. Metrics are sent to
OpenTSDB and a separate HBase
cluster.
We also use scollector to run hbck and
parse the output into metrics.
Metrics are queried by Bosun for
alerting and Grafana for visualization.
Region ServerRegion ServerRegion Server
scollector
TSDRelayTSDRelay
TSDRelay
Region ServerRegion ServerMasters
scollector
Metrics
HBase
TSDRelayTSDRelay
Write TSDs
Bosun
Grafana
TSDRelayTSDRelay
Read TSDs
Monitoring
Total requests per region server (from
region server metrics) helps detect
poorly balanced regions.
Monitoring
99p latencies (from region server
metrics) can show region servers that
are unhealthy due to GC, imbalance, or
underlying hardware issues.
Monitoring
We closely track percent_files_local
(from region server metrics) because
performance and stability are affected
by poor locality.
Monitoring
Inconsistent tables (reported by hbck)
can show underlying hbase metadata
issues. Here a region server failed,
causing many tables to become
inconsistent. Most recovered, but one
did not until manual action was taken.
Some consistency issues can be fixed
by restarting masters, others require
running hcbk fix commands.
Next steps
• Cross-datacenter replication and failover
• Automating recovery procedures (killing failing
nodes, restarting masters, running hbck commands)
• Automating provisioning of capacity
Questions

HBaseCon2017 Highly-Available HBase

  • 1.
    Highly Available HBase MicahWylde @mwylde HBaseCon ‘17
  • 2.
    What is SiftScience Sift Science protects online businesses from fraud using real-time machine learning. We work with hundreds of customers across a range of verticals, countries, and fraud types.
  • 3.
    What is SiftScience? sift Customer backend bob added credit card bob 27 Carrier 12:00 PM bob opened app Googlehttp://domain.com Web Page Title bob loaded cart page fraud score
  • 4.
    HBase at Sift Weuse HBase to store all user-level data—hundreds of terabytes. We make hundreds of thousands of requests per second to our online HBase clusters. Producing a risk score for a user may require dozens of HBase queries. 600TB ● 48K regions ● 250 servers
  • 5.
    Why HBase • Scalableto millions of requests per second and petabytes of data • Strictly consistent writes and reads • Supports write-heavy workloads • Highly available …in theory
  • 6.
    When we’re down,our customers can’t make money
  • 7.
    We went downa lot last year… mostly due to HBase
  • 8.
    Since then we’vemostly eliminated HBase downtime
  • 9.
  • 10.
    Step 0: Prioritizereliability (this means deferring new features)
  • 11.
  • 12.
    Symptom: When a singleregion server became unavailable or slow, our application would stop doing work.
  • 13.
    Replicating the issue withChaos Engineering • Killing processes • Killing servers • Partitioning the network • Throttling network on HBase port
  • 14.
    Replicating the issue withChaos Engineering $ tc qdisc add dev eth0 handle ffff: ingress $ tc filter add dev eth0 parent ffff: protocol ip prio 50 u32 match ip protocol 6 0xff match ip dport 60020 0xffff police rate 50kbit burst 10k drop flowid :1 Sets the bandwidth available for HBase to 50 kb/s (don’t try this on your production cluster)
  • 15.
    What’s going on? Profilingshowed that all threads are stuck waiting on HBase. Even though just one HBase server is down, our request volume is so high that all handler threads eventually hit that server and get stuck. runnable blocked waiting
  • 16.
    Circuit Breaking A patternin distributed systems where clients monitor the health of the servers they communicate with. If too many requests fail, the circuit breaker trips and requests fail immediately. A small fraction of requests are let through to gauge when the circuit becomes healthy again. Closed Open Half-Open trips breakersuccess fail fast make request trips breaker request fails request succeeds
  • 17.
    How well doesthis work? very effective when one region server is unhealthy circuit breaker control
  • 18.
    Circuit Breaking inhbase-client Subclass RpcRetryingCaller / DelegatingRetryingCallable private static class HystrixRegionServerCallable<R> extends DelegatingRetryingCallable<R, RegionServerCallable<R>> { @Override public void prepare(boolean reload) throws IOException { delegate.prepare(reload); if (delegate instanceof MultiServerCallable) { server = ((MultiServerCallable) delegate).getServerName(); } else { HRegionLocation location = delegate.getLocation(); server = location.getServerName(); } setter = HystrixCommand.Setter .withGroupKey(HystrixCommandGroupKey.Factory.asKey(REGIONSERVER_KEY)) .andCommandKey(HystrixCommandKey.Factory.asKey( server.getHostAndPort())); } }
  • 19.
    Circuit Breaking inhbase-client Subclass RpcRetryingCaller / DelegatingRetryingCallable private static class HystrixRegionServerCallable<R> extends DelegatingRetryingCallable<R, RegionServerCallable<R>> { @Override public R call(final int timeout) throws Exception { if (setter != null) { try { return new HystrixCommand<R>(setter) { @Override public R run() throws Exception { return delegate.call(timeout); } }.execute(); } catch (HystrixRuntimeException e) { log.debug("Failed", e); if (e.getFailureType() == HystrixRuntimeException.FailureType.SHORTCIRCUIT) { throw new DoNotRetryRegionException(e.getMessage()); } else if (e.getCause() instanceof Exception) { throw (Exception) e.getCause(); } throw e; } } else { return delegate.call(timeout);
  • 20.
    Circuit Breaking inhbase-client Subclass RpcRetryingCaller public static class HystrixRpcCaller<T> extends RpcRetryingCaller<T> { @Override public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout) throws IOException, RuntimeException { return super.callWithRetries(wrap(callable), callTimeout); } @Override public T callWithoutRetries(RetryingCallable<T> callable, int callTimeout) throws IOException { return super.callWithoutRetries(wrap(callable), callTimeout); } private RetryingCallable<T> wrap(RetryingCallable<T> callable) { if (callable instanceof RegionServerCallable) { return new HystrixRegionServerCallable<>( (RegionServerCallable<T>) callable, maxConcurrentReqs, timeout); } return callable; } }
  • 21.
    Circuit Breaking inhbase-client Subclass RpcRetryingCallerFactory public class HystrixRpcCallerFactory extends RpcRetryingCallerFactory { public HystrixRpcCallerFactory(Configuration conf) { super(conf); } @Override public <T> RpcRetryingCaller<T> newCaller() { return new HystrixRpcCaller<>(conf); } } // override the caller factory in HBase config conf.set(RpcRetryingCallerFactory.CUSTOM_CALLER_CONF_KEY, HystrixRpcCallerFactory.class.getCanonicalName());
  • 22.
  • 23.
    Replication Circuit breaking helpsus avoid downtime when a small number of region servers are unhealthy. Replication allows us to recover quickly when the entire cluster is unhealthy. This most often occurs due to HDFS issues or HBase metadata issues. cluster 1 cluster 2 replication application zookeeperzookeeperzookeeper cluster 1 is primary primary connection fallback
  • 24.
    Replication We keep activeconnections to all clusters to enable fast switching. A zookeeper-backed connection provider is responsible for handing out connections to the current cluster. If we see a high error rate from a cluster, we can quickly switch to the other while we investigate and fix. This also allows us to do a full cluster without downtime, speeding up our ability to roll out new configurations and HBase code. cluster 1 cluster 2 replication application zookeeperzookeeperzookeeper cluster 2 is primary fallback primary connection
  • 25.
    Replication Fail over betweenclusters takes less than a second across our entire application fleet. Connection configuration is also stored in zookeeper, so we can add and remove clusters without code changes or restarts. requests per region server during switch
  • 26.
    Replication To verify inter-clusterconsistency we rely on map reduce jobs and online client-side verification. We automatically send a small percentage of non-mutating requests to the non-active clusters using a custom subclass of HTable, comparing the responses to those from the primary cluster.
  • 27.
  • 28.
    Monitoring We collect detailedmetrics from HBase region servers and masters using scollector. Metrics are sent to OpenTSDB and a separate HBase cluster. We also use scollector to run hbck and parse the output into metrics. Metrics are queried by Bosun for alerting and Grafana for visualization. Region ServerRegion ServerRegion Server scollector TSDRelayTSDRelay TSDRelay Region ServerRegion ServerMasters scollector Metrics HBase TSDRelayTSDRelay Write TSDs Bosun Grafana TSDRelayTSDRelay Read TSDs
  • 29.
    Monitoring Total requests perregion server (from region server metrics) helps detect poorly balanced regions.
  • 30.
    Monitoring 99p latencies (fromregion server metrics) can show region servers that are unhealthy due to GC, imbalance, or underlying hardware issues.
  • 31.
    Monitoring We closely trackpercent_files_local (from region server metrics) because performance and stability are affected by poor locality.
  • 32.
    Monitoring Inconsistent tables (reportedby hbck) can show underlying hbase metadata issues. Here a region server failed, causing many tables to become inconsistent. Most recovered, but one did not until manual action was taken. Some consistency issues can be fixed by restarting masters, others require running hcbk fix commands.
  • 33.
    Next steps • Cross-datacenterreplication and failover • Automating recovery procedures (killing failing nodes, restarting masters, running hbck commands) • Automating provisioning of capacity
  • 34.