Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Supporting Apache HBase : Troubleshooting and Supportability Improvements

527 views

Published on

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Published in: Technology
  • Be the first to comment

Supporting Apache HBase : Troubleshooting and Supportability Improvements

  1. 1. Supporting Apache HBase Troubleshooting and Supportability Improvements
  2. 2. 2© Cloudera, Inc. All rights reserved. Who we are • Daisuke Kobayashi (d1ce_) • Customer support at Cloudera since 2012, focusing on HDFS and HBase specifically • Apache HBase contributor • Toshihiro Suzuki (brfrn169) • Apache HBase committer since 2018 • Sr. Software Engineer, Breakfix (HBase/Phoenix, HDFS) at Cloudera • Wrote and Published a book based on HBase for beginners in Japanese
  3. 3. 3© Cloudera, Inc. All rights reserved. Supporting HBase • Typical Troubleshooting Scenario with HBase • Fix performance degradation (Slowness) • Identify the reason of process being crashed • Fix inconsistencies
  4. 4. 4© Cloudera, Inc. All rights reserved. Agenda • General approach to HBase performance issues with existing tools • htop - Real-time monitoring tool for HBase
  5. 5. © Cloudera, Inc. All rights reserved. General approach to HBase performance issues with existing tools (Logs and metrics are strictly aligned to HBase 2.1 (CDH 6.2)
  6. 6. 6 © Cloudera, Inc. All rights reserved. • Performance issues are tough! • Typical reasons • “Hot Spot” Region • Region with Non-Local Data • Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk • Stop the world with long GC pauses in RegionServers • Slowness Due To High Processor Usage • Network Saturation, etc. • Source of truth • Logs (a lot!) • Metrics (a lot!) Troubleshooting Performance Issues
  7. 7. 7© Cloudera, Inc. All rights reserved. Approach to Performance Troubleshooting Source - https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools • Understanding the issue • Top-down • USE Method (epecifically, focusing on U and S in this talk)
  8. 8. 8© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer MemStoreBlockCache RPC System (Handlers / Queues) HDFS Client
  9. 9. 9© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  10. 10. 10© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client Cache Size Cache Eviction Ratio Flush Size Frequency of requests Memstore Size Frequency of flush RPC Processed Time, Queue Length & Time Flush Queue MemStoreBlockCache Frequency of blocking updates
  11. 11. 11© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  12. 12. 12© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Numer of RPC requests • Incremented by one by the following actions at the RPC server level • doReplayBatchOp, closeRegion, compactRegion, flushRegion, getOnlineRegion, getRegionInfo, getServerInfo, openRegion, rollWALWriter, bulkLoadHFile, prepareBulkLoad, get, multi, mutate, scan "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "totalRequestCount" : 167130, HBASE-21207 made the columns sortable! Master webui Raw metrics
  13. 13. 13© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • RPC queue length & request size "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "queueSize" : 619211, "numCallsInGeneralQueue" : 5, "numCallsInPriorityQueue" : 0, Queue for hight priority handlers to deal with admin requests and system table operation requests. # of handler is controlled by hbase.regionserver.metahandler.count Queue for normal handlers. # of handler is controlled by hbase.regionserver.handler.count Running count of the size in bytes of all outstanding calls whether currently executing or queued waiting to be run. RegionServer webui
  14. 14. 14© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "ProcessCallTime_num_ops" : 10961, "QueueCallTime_num_ops" : 10961, Cloudera Manager chart: select ipc_process_rate, ipc_queue_rate where roleType = REGIONSERVER Raw metrics • Number of processed/queued requests • If queued > processed, time to check thread dump
  15. 15. 15© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Observability Improvements • In case of slowness on scan.next() call, the target region name was unknown in the past. • HBASE-16972 improved the logging by adding ‘scandetails’.2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer"} 2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer","scandetails":"table: cluster_test region: cluster_test,19999998,1557654024101.db9b3c6211849f53e8857e55279b8d12."}
  16. 16. 16© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  17. 17. 17© Cloudera, Inc. All rights reserved. RegionServer webui Memstore Utilization & Saturation Raw metrics "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "memStoreSize" : 5372418924, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Regions", "Namespace_default_table_cluster_test_region_7cdc92fd59a4f1a96b431552d952560c_metric_memStoreSize" : 18295903, "Namespace_default_table_dice2_region_155bf45f338288ae19cc0e3841a5d013_metric_memStoreSize" : 0, "Namespace_default_table_cluster_test_region_d5349e089ff8129faa1e35dee2957e27_metric_memStoreSize" : 4642160, • Memstore size
  18. 18. 18© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select memstore_size where category = HREGION Memstore Utilization & Saturation Cloudera Manager chart: select total_memstore_size_across_hregions where roleType = REGIONSERVER Compare the total memsore size across RegionServers Compare across regions in size in a RegionServer
  19. 19. 19© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation • Log snippet where a flush finishes • Frequency of flush (per hour) 2019-04-13 01:28:56,376 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished flush of dataSize ~105.70 MB/110836931, heapSize ~105.85 MB/110989816, currentSize=2.94 MB/3084019 for 3db6134cedc326474801068c3cb4f2a9 in 1625ms, sequenceid=4255, compaction requested=true Cell’s data alone, key bytes and value bytes, that is going to be flushed. This can be allocated off-heap too. Cell’s data on-heap along with its metadata and index (overhead of Java objects) Cell’s data alone on-heap after the flushEncoded region name How long did the flush take to complete? # grep "Finished flush of" <rs_log> | grep -o "^2019-..-.. .." | uniq -c 81 2019-05-13 17 6 2019-05-13 18 113 2019-05-15 02 18 2019-05-15 04 27 2019-05-15 12 133 2019-05-15 19 5 2019-05-15 20 198 2019-05-15 22 91 2019-05-15 23
  20. 20. 20© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation 2019-05-13 17:12:08,001 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Blocking updates: global memstore heapsize 403.0 M is >= blocking 403.0 M 2019-05-13 17:12:10,809 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Memstore is above high water mark and block 2808ms 2019-05-13 17:12:10,809 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Unblocking updates for server host-10-17-101-197.coe.cloudera.com,22101,1557773899580 • Indication of blocked updates due to high memstore utilization • Global memstore > hbase.regionserver.global.memstore.size • A memstore > hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size Why were updates blocked? How long was it blocked? Blocking updates finished 19/05/20 07:39:22 INFO client.RpcRetryingCallerImpl: Call exception, tries=7, retries=11, started=8164 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.RegionTooBusyException: Over memstore limit=128.0M, regionName=d5860b5e1a35025b6aab68dff4d944aa, server=host-10-17-101- 198.coe.cloudera.com,22101,1558363100074
  21. 21. 21© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  22. 22. 22© Cloudera, Inc. All rights reserved. Blockcache Utilization & Saturation • Current block cache usage • Cache eviction "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheSize" : 406847872, "blockCacheFreeSize" : 6291459, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheEvictionCount" : 38257, Raw metrics RegionServer webui
  23. 23. 23© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select block_cache_free_size where roleType = REGIONSERVER Blockcache Utilization & Saturation Cloudera Manager chart: select block_cache_evicted_rate where roleType = REGIONSERVER Compare the free size across RegionServers Compare the evicted blocks ratio across RegionServers
  24. 24. 24© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  25. 25. 25© Cloudera, Inc. All rights reserved. HDFS Client Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "flushQueueLength" : 0, RegionServer webui Raw metrics Cloudera Manager chart: select flush_queue_size where roleType = REGIONSERVER • Flush queue size
  26. 26. © Cloudera, Inc. All rights reserved. htop – Real-Time Monitoring Tool for HBase
  27. 27. 27 © Cloudera, Inc. All rights reserved. • HBASE-11062 htop • Work in Progress! • Unix top-like tool • Real-time monitoring for hbase metrics htop overview
  28. 28. 28 © Cloudera, Inc. All rights reserved. • HBase UIs • The metrics of the moment • Can't see the metrics in time series • Ganglia/OpenTSDB/Cloudera Manager/Ambari Metrics (via Grafana) • The metrics in time series • Collecting the latest metrics takes a little bit time • htop • Real-time monitoring • A lot of features for real-time monitoring htop motivation
  29. 29. 29 © Cloudera, Inc. All rights reserved. htop motivation HBase UI Ganglia/OpenTSDB/ Cloudera Manager/ Ambari Metrics htop Metrics of the Moment ○ △ ○ Metrics in Time Series ☓ ○ ☓ Real-Time Monitoring △ △ ○
  30. 30. 30 © Cloudera, Inc. All rights reserved. htop features htop screen • Command to start htop: • $ hbase top • Similar to Unix top command • The metrics are refreshed in a certain period – 3 seconds by default • Vertical and Horizontal scrolling
  31. 31. 31 © Cloudera, Inc. All rights reserved. htop features htop screen • Demo (https://asciinema.org/a/247434)
  32. 32. 32 © Cloudera, Inc. All rights reserved. • Press d key and put a new refresh delay • We can also change the default refresh delay by specifying a command line argument: • ex) $ hbase top -delay 2 # the default refresh delay is 2 seconds htop features Change refresh delay
  33. 33. 33 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247447) htop features Change refresh delay
  34. 34. 34 © Cloudera, Inc. All rights reserved. • Press m key and choose mode • Namespace mode • metrics per Namespace • Table mode • metrics per Table • RegionServer mode • metrics per RegionServer • Region mode (default) • metrics per Region • We can also change the default mode by specifying a command line argument: • ex) $ hbase top -mode n # the default mode is Namespace mode htop features Metrics per Namespace/Table/RegionServer/Region
  35. 35. 35 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247177) htop features Metrics per Namespace/Table/RegionServer/Region
  36. 36. 36 © Cloudera, Inc. All rights reserved. • Press f key and choose displayed fields (by pressing space key) • We can also change the order of the fields in the same screen • Right key selects for move then Left key or Enter key comments htop features Choose displayed fields and change the order of fields
  37. 37. 37 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247306) htop features Choose displayed fields and change the order of fields
  38. 38. 38 © Cloudera, Inc. All rights reserved. • Press f key and choose a sort field (by pressing s key) • Switch to the descending/ascending order by pressing R key • Demo (https://asciinema.org/a/247180) htop features Sort the metrics by the field values
  39. 39. 39 © Cloudera, Inc. All rights reserved. • ex) NAMESPACE==default, REQ/S>1000 • Operators: = (only needs a partial match), == (needs a exact match), >, >=, <, <=, ! • o key: Add a filter with ignore case • O key: Add a filter with case sensitive • ctrl + o key: Show current filters • = key: Clear current filters htop features Filter with the field values
  40. 40. 40 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247181) htop features Filter with the field values
  41. 41. 41 © Cloudera, Inc. All rights reserved. • Namespace -> Tables • Table -> Regions • RegionServer -> Regions • Select a record (Namespace, Table or RegionServer) you want to drill down and Press i key htop features Drill down
  42. 42. 42 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247182) htop features Drill down
  43. 43. 43 © Cloudera, Inc. All rights reserved. • htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics() • It needs to access only HBase Master • If we add more metrics, we first need to add them to ClusterMetrics • The metrics from JMX endpoints will give more metrics but it needs to access all RegionServers, which might cause scalability issues htop internals
  44. 44. 44 © Cloudera, Inc. All rights reserved. • Not committed yet and a work in progress • Building htop for HBase 2.x • The basic features have been implemented • The remaining tasks for htop • Some code refactoring • Adding some tests • Documentation Current status of htop
  45. 45. 45 © Cloudera, Inc. All rights reserved. • Support branch-1 • Add more metrics so that we can see more information from htop • Response time metrics ASAP • The metrics per Column Family/User/Operation (GET, PUT, SCAN, etc.) • System information like CPU usage and memory usage might be useful • Useful features in Unix top command • Color mapping • Batch mode, etc. htop in the future
  46. 46. THANK YOU
  47. 47. 47 © Cloudera, Inc. All rights reserved. Q & A

×