Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

HBaseCon2017 Highly-Available HBase

414 views

Published on

In order to effectively predict and prevent online fraud in real time, Sift Science stores hundreds of terabytes of data in HBase—and needs it to be always available. This talk will cover how we used circuit-breaking, cluster failover, monitoring, and automated recovery procedures to improve our HBase uptime from 99.7% to 99.99% on top of unreliable cloud hardware and networks.

Published in: Technology
  • Be the first to comment

HBaseCon2017 Highly-Available HBase

  1. 1. Highly Available HBase Micah Wylde @mwylde HBaseCon ‘17
  2. 2. What is Sift Science Sift Science protects online businesses from fraud using real-time machine learning. We work with hundreds of customers across a range of verticals, countries, and fraud types.
  3. 3. What is Sift Science? sift Customer backend bob added credit card bob 27 Carrier 12:00 PM bob opened app Googlehttp://domain.com Web Page Title bob loaded cart page fraud score
  4. 4. HBase at Sift We use HBase to store all user-level data—hundreds of terabytes. We make hundreds of thousands of requests per second to our online HBase clusters. Producing a risk score for a user may require dozens of HBase queries. 600TB ● 48K regions ● 250 servers
  5. 5. Why HBase • Scalable to millions of requests per second and petabytes of data • Strictly consistent writes and reads • Supports write-heavy workloads • Highly available …in theory
  6. 6. When we’re down, our customers can’t make money
  7. 7. We went down a lot last year… mostly due to HBase
  8. 8. Since then we’ve mostly eliminated HBase downtime
  9. 9. How?
  10. 10. Step 0: Prioritize reliability (this means deferring new features)
  11. 11. Circuit Breaking
  12. 12. Symptom: When a single region server became unavailable or slow, our application would stop doing work.
  13. 13. Replicating the issue with Chaos Engineering • Killing processes • Killing servers • Partitioning the network • Throttling network on HBase port
  14. 14. Replicating the issue with Chaos Engineering $ tc qdisc add dev eth0 handle ffff: ingress $ tc filter add dev eth0 parent ffff: protocol ip prio 50 u32 match ip protocol 6 0xff match ip dport 60020 0xffff police rate 50kbit burst 10k drop flowid :1 Sets the bandwidth available for HBase to 50 kb/s (don’t try this on your production cluster)
  15. 15. What’s going on? Profiling showed that all threads are stuck waiting on HBase. Even though just one HBase server is down, our request volume is so high that all handler threads eventually hit that server and get stuck. runnable blocked waiting
  16. 16. Circuit Breaking A pattern in distributed systems where clients monitor the health of the servers they communicate with. If too many requests fail, the circuit breaker trips and requests fail immediately. A small fraction of requests are let through to gauge when the circuit becomes healthy again. Closed Open Half-Open trips breakersuccess fail fast make request trips breaker request fails request succeeds
  17. 17. How well does this work? very effective when one region server is unhealthy circuit breaker control
  18. 18. Circuit Breaking in hbase-client Subclass RpcRetryingCaller / DelegatingRetryingCallable private static class HystrixRegionServerCallable<R> extends DelegatingRetryingCallable<R, RegionServerCallable<R>> { @Override public void prepare(boolean reload) throws IOException { delegate.prepare(reload); if (delegate instanceof MultiServerCallable) { server = ((MultiServerCallable) delegate).getServerName(); } else { HRegionLocation location = delegate.getLocation(); server = location.getServerName(); } setter = HystrixCommand.Setter .withGroupKey(HystrixCommandGroupKey.Factory.asKey(REGIONSERVER_KEY)) .andCommandKey(HystrixCommandKey.Factory.asKey( server.getHostAndPort())); } }
  19. 19. Circuit Breaking in hbase-client Subclass RpcRetryingCaller / DelegatingRetryingCallable private static class HystrixRegionServerCallable<R> extends DelegatingRetryingCallable<R, RegionServerCallable<R>> { @Override public R call(final int timeout) throws Exception { if (setter != null) { try { return new HystrixCommand<R>(setter) { @Override public R run() throws Exception { return delegate.call(timeout); } }.execute(); } catch (HystrixRuntimeException e) { log.debug("Failed", e); if (e.getFailureType() == HystrixRuntimeException.FailureType.SHORTCIRCUIT) { throw new DoNotRetryRegionException(e.getMessage()); } else if (e.getCause() instanceof Exception) { throw (Exception) e.getCause(); } throw e; } } else { return delegate.call(timeout);
  20. 20. Circuit Breaking in hbase-client Subclass RpcRetryingCaller public static class HystrixRpcCaller<T> extends RpcRetryingCaller<T> { @Override public synchronized T callWithRetries(RetryingCallable<T> callable, int callTimeout) throws IOException, RuntimeException { return super.callWithRetries(wrap(callable), callTimeout); } @Override public T callWithoutRetries(RetryingCallable<T> callable, int callTimeout) throws IOException { return super.callWithoutRetries(wrap(callable), callTimeout); } private RetryingCallable<T> wrap(RetryingCallable<T> callable) { if (callable instanceof RegionServerCallable) { return new HystrixRegionServerCallable<>( (RegionServerCallable<T>) callable, maxConcurrentReqs, timeout); } return callable; } }
  21. 21. Circuit Breaking in hbase-client Subclass RpcRetryingCallerFactory public class HystrixRpcCallerFactory extends RpcRetryingCallerFactory { public HystrixRpcCallerFactory(Configuration conf) { super(conf); } @Override public <T> RpcRetryingCaller<T> newCaller() { return new HystrixRpcCaller<>(conf); } } // override the caller factory in HBase config conf.set(RpcRetryingCallerFactory.CUSTOM_CALLER_CONF_KEY, HystrixRpcCallerFactory.class.getCanonicalName());
  22. 22. Replication
  23. 23. Replication Circuit breaking helps us avoid downtime when a small number of region servers are unhealthy. Replication allows us to recover quickly when the entire cluster is unhealthy. This most often occurs due to HDFS issues or HBase metadata issues. cluster 1 cluster 2 replication application zookeeperzookeeperzookeeper cluster 1 is primary primary connection fallback
  24. 24. Replication We keep active connections to all clusters to enable fast switching. A zookeeper-backed connection provider is responsible for handing out connections to the current cluster. If we see a high error rate from a cluster, we can quickly switch to the other while we investigate and fix. This also allows us to do a full cluster without downtime, speeding up our ability to roll out new configurations and HBase code. cluster 1 cluster 2 replication application zookeeperzookeeperzookeeper cluster 2 is primary fallback primary connection
  25. 25. Replication Fail over between clusters takes less than a second across our entire application fleet. Connection configuration is also stored in zookeeper, so we can add and remove clusters without code changes or restarts. requests per region server during switch
  26. 26. Replication To verify inter-cluster consistency we rely on map reduce jobs and online client-side verification. We automatically send a small percentage of non-mutating requests to the non-active clusters using a custom subclass of HTable, comparing the responses to those from the primary cluster.
  27. 27. Monitoring
  28. 28. Monitoring We collect detailed metrics from HBase region servers and masters using scollector. Metrics are sent to OpenTSDB and a separate HBase cluster. We also use scollector to run hbck and parse the output into metrics. Metrics are queried by Bosun for alerting and Grafana for visualization. Region ServerRegion ServerRegion Server scollector TSDRelayTSDRelay TSDRelay Region ServerRegion ServerMasters scollector Metrics HBase TSDRelayTSDRelay Write TSDs Bosun Grafana TSDRelayTSDRelay Read TSDs
  29. 29. Monitoring Total requests per region server (from region server metrics) helps detect poorly balanced regions.
  30. 30. Monitoring 99p latencies (from region server metrics) can show region servers that are unhealthy due to GC, imbalance, or underlying hardware issues.
  31. 31. Monitoring We closely track percent_files_local (from region server metrics) because performance and stability are affected by poor locality.
  32. 32. Monitoring Inconsistent tables (reported by hbck) can show underlying hbase metadata issues. Here a region server failed, causing many tables to become inconsistent. Most recovered, but one did not until manual action was taken. Some consistency issues can be fixed by restarting masters, others require running hcbk fix commands.
  33. 33. Next steps • Cross-datacenter replication and failover • Automating recovery procedures (killing failing nodes, restarting masters, running hbck commands) • Automating provisioning of capacity
  34. 34. Questions

×