Supporting Apache HBase : Troubleshooting and Supportability Improvements

•Download as PPTX, PDF•

2 likes•457 views

DataWorks Summit

HBase has been in production in hundreds of clusters across the CDH/HDP customer base and Cloudera/Hortonworks support it for many years. In this talk, based on our support experience, we aim to introduce useful information to troubleshoot HBase clusters efficiently. First off, we (Daisuke at Cloudera support) are going to talk about typical log messages and web UI info which we can use for troubleshooting (especially for struggling with performance issues). Since their meanings have been changing over the past versions, we would like to show the difference and improvements as well (e.g. HBASE-20232 for memstore flush, HBASE-16972 for slow scanner, HBASE-18469 for request counter, and also HBASE-21207 for sorting in web UI). We (Toshihiro at Cloudera, a former Hortonworks employee) will also cover some new tools (e.g. HBASE-21926 Profiler Servlet, HBASE-11062 htop, etc.), which should also be useful for performance troubleshooting.

Supporting Apache HBase
Troubleshooting and Supportability Improvements

2© Cloudera, Inc. All rights reserved.
Who we are
• Daisuke Kobayashi (d1ce_)
• Customer support at Cloudera since 2012, focusing
on HDFS and HBase specifically
• Apache HBase contributor
• Toshihiro Suzuki (brfrn169)
• Apache HBase committer since 2018
• Sr. Software Engineer, Breakfix (HBase/Phoenix, HDFS) at Cloudera
• Wrote and Published a book based on HBase for beginners in Japanese

3© Cloudera, Inc. All rights reserved.
Supporting HBase
• Typical Troubleshooting Scenario with HBase
• Fix performance degradation (Slowness)
• Identify the reason of process being crashed
• Fix inconsistencies

4© Cloudera, Inc. All rights reserved.
Agenda
• General approach to HBase performance issues with existing tools
• htop - Real-time monitoring tool for HBase

© Cloudera, Inc. All rights reserved.
General approach to HBase performance issues with existing
tools
(Logs and metrics are strictly aligned to HBase 2.1 (CDH 6.2)

6 © Cloudera, Inc. All rights reserved.
• Performance issues are tough!
• Typical reasons
• “Hot Spot” Region
• Region with Non-Local Data
• Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk
• Stop the world with long GC pauses in RegionServers
• Slowness Due To High Processor Usage
• Network Saturation, etc.
• Source of truth
• Logs (a lot!)
• Metrics (a lot!)
Troubleshooting Performance Issues

7© Cloudera, Inc. All rights reserved.
Approach to Performance Troubleshooting
Source -
https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools
• Understanding the issue
• Top-down
• USE Method (epecifically, focusing on U and S in this talk)

8© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
MemStoreBlockCache
RPC System (Handlers / Queues)
HDFS Client

9© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache

10© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
RPC System (Handlers / Queues)
HDFS Client
Cache Size
Cache Eviction Ratio Flush Size
Frequency of requests
Memstore Size
Frequency of flush
RPC Processed Time, Queue Length & Time
Flush Queue
MemStoreBlockCache
Frequency of blocking updates

11© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache

12© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• Numer of RPC requests
• Incremented by one by the following actions at the RPC server level
• doReplayBatchOp, closeRegion, compactRegion, flushRegion,
getOnlineRegion, getRegionInfo, getServerInfo, openRegion,
rollWALWriter, bulkLoadHFile, prepareBulkLoad, get, multi, mutate,
scan
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"totalRequestCount" : 167130,
HBASE-21207
made the columns
sortable!
Master webui
Raw metrics

13© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• RPC queue length & request size
"name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
"queueSize" : 619211,
"numCallsInGeneralQueue" : 5,
"numCallsInPriorityQueue" : 0,
Queue for hight priority handlers to deal
with admin requests and system table
operation requests.
# of handler is controlled by
hbase.regionserver.metahandler.count
Queue for normal handlers.
# of handler is controlled by
hbase.regionserver.handler.count
Running count of the size in bytes of all
outstanding calls whether currently executing or
queued waiting to be run.
RegionServer webui

14© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
"name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
"ProcessCallTime_num_ops" : 10961,
"QueueCallTime_num_ops" : 10961,
Cloudera Manager chart:
select ipc_process_rate, ipc_queue_rate
where roleType = REGIONSERVER
Raw metrics
• Number of processed/queued requests
• If queued > processed, time to check thread dump

$15© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Observability Improvements • In case of slowness on scan.next() call, the target region name was unknown in the past. • HBASE-16972 improved the logging by adding ‘scandetails’.2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer"} 2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer","scandetails":"table: cluster_test region: cluster_test,19999998,1557654024101.db9b3c6211849f53e8857e55279b8d12."}$

16© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache

17© Cloudera, Inc. All rights reserved.
RegionServer webui
Memstore Utilization & Saturation
Raw metrics
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"memStoreSize" : 5372418924,
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Regions",
"Namespace_default_table_cluster_test_region_7cdc92fd59a4f1a96b431552d952560c_metric_memStoreSize" : 18295903,
"Namespace_default_table_dice2_region_155bf45f338288ae19cc0e3841a5d013_metric_memStoreSize" : 0,
"Namespace_default_table_cluster_test_region_d5349e089ff8129faa1e35dee2957e27_metric_memStoreSize" : 4642160,
• Memstore size

18© Cloudera, Inc. All rights reserved.
Cloudera Manager chart:
select memstore_size
where category = HREGION
Memstore Utilization & Saturation
Cloudera Manager chart:
select total_memstore_size_across_hregions
where roleType = REGIONSERVER
Compare the total memsore
size across RegionServers
Compare across regions
in size in a RegionServer

19© Cloudera, Inc. All rights reserved.
Memstore Utilization & Saturation
• Log snippet where a flush finishes
• Frequency of flush (per hour)
2019-04-13 01:28:56,376 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished flush of
dataSize ~105.70 MB/110836931, heapSize ~105.85 MB/110989816, currentSize=2.94 MB/3084019 for
3db6134cedc326474801068c3cb4f2a9 in 1625ms, sequenceid=4255, compaction requested=true
Cell’s data alone, key bytes and value
bytes, that is going to be flushed.
This can be allocated off-heap too.
Cell’s data on-heap along with its
metadata and index (overhead of
Java objects)
Cell’s data alone on-heap
after the flushEncoded region name
How long did the flush
take to complete?
# grep "Finished flush of" <rs_log> | grep -o "^2019-..-.. .." | uniq -c
81 2019-05-13 17
6 2019-05-13 18
113 2019-05-15 02
18 2019-05-15 04
27 2019-05-15 12
133 2019-05-15 19
5 2019-05-15 20
198 2019-05-15 22
91 2019-05-15 23

20© Cloudera, Inc. All rights reserved.
Memstore Utilization & Saturation
2019-05-13 17:12:08,001 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Blocking
updates: global memstore heapsize 403.0 M is >= blocking 403.0 M
2019-05-13 17:12:10,809 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Memstore is
above high water mark and block 2808ms
2019-05-13 17:12:10,809 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Unblocking
updates for server host-10-17-101-197.coe.cloudera.com,22101,1557773899580
• Indication of blocked updates due to high memstore utilization
• Global memstore > hbase.regionserver.global.memstore.size
• A memstore > hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size
Why were updates blocked? How long was it blocked? Blocking updates finished
19/05/20 07:39:22 INFO client.RpcRetryingCallerImpl: Call exception, tries=7, retries=11,
started=8164 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.RegionTooBusyException: Over
memstore limit=128.0M, regionName=d5860b5e1a35025b6aab68dff4d944aa, server=host-10-17-101-
198.coe.cloudera.com,22101,1558363100074

21© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache

22© Cloudera, Inc. All rights reserved.
Blockcache Utilization & Saturation
• Current block cache usage
• Cache eviction
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"blockCacheSize" : 406847872,
"blockCacheFreeSize" : 6291459,
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"blockCacheEvictionCount" : 38257,
Raw metrics RegionServer webui

23© Cloudera, Inc. All rights reserved.
Cloudera Manager chart:
select block_cache_free_size
where roleType = REGIONSERVER
Blockcache Utilization & Saturation
Cloudera Manager chart:
select block_cache_evicted_rate
where roleType = REGIONSERVER
Compare the free size
across RegionServers
Compare the evicted
blocks ratio across
RegionServers

24© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache

25© Cloudera, Inc. All rights reserved.
HDFS Client Utilization & Saturation
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"flushQueueLength" : 0,
RegionServer webui
Raw metrics
Cloudera Manager chart:
select flush_queue_size
where roleType = REGIONSERVER
• Flush queue size

© Cloudera, Inc. All rights reserved.
htop – Real-Time Monitoring Tool for HBase

27 © Cloudera, Inc. All rights reserved.
• HBASE-11062 htop
• Work in Progress!
• Unix top-like tool
• Real-time monitoring for hbase metrics
htop overview

28 © Cloudera, Inc. All rights reserved.
• HBase UIs
• The metrics of the moment
• Can't see the metrics in time series
• Ganglia/OpenTSDB/Cloudera Manager/Ambari Metrics (via Grafana)
• The metrics in time series
• Collecting the latest metrics takes a little bit time
• htop
• Real-time monitoring
• A lot of features for real-time monitoring
htop motivation

29 © Cloudera, Inc. All rights reserved.
htop motivation
HBase UI
Ganglia/OpenTSDB/
Cloudera Manager/
Ambari Metrics
htop
Metrics of the Moment ○ △ ○
Metrics in Time Series ☓ ○ ☓
Real-Time Monitoring △ △ ○

30 © Cloudera, Inc. All rights reserved.
htop features
htop screen
• Command to start htop:
• $ hbase top
• Similar to Unix top command
• The metrics are refreshed in a certain period – 3 seconds by default
• Vertical and Horizontal scrolling

31 © Cloudera, Inc. All rights reserved.
htop features
htop screen
• Demo (https://asciinema.org/a/247434)

32 © Cloudera, Inc. All rights reserved.
• Press d key and put a new refresh delay
• We can also change the default refresh delay by specifying a command line
argument:
• ex) $ hbase top -delay 2 # the default refresh delay is 2 seconds
htop features
Change refresh delay

33 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247447)
htop features
Change refresh delay

34 © Cloudera, Inc. All rights reserved.
• Press m key and choose mode
• Namespace mode
• metrics per Namespace
• Table mode
• metrics per Table
• RegionServer mode
• metrics per RegionServer
• Region mode (default)
• metrics per Region
• We can also change the default mode by specifying a command line argument:
• ex) $ hbase top -mode n # the default mode is Namespace mode
htop features
Metrics per Namespace/Table/RegionServer/Region

35 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247177)
htop features
Metrics per Namespace/Table/RegionServer/Region

36 © Cloudera, Inc. All rights reserved.
• Press f key and choose displayed fields (by pressing space key)
• We can also change the order of the fields in the same screen
• Right key selects for move then Left key or Enter key comments
htop features
Choose displayed fields and change the order of fields

37 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247306)
htop features
Choose displayed fields and change the order of fields

38 © Cloudera, Inc. All rights reserved.
• Press f key and choose a sort field (by pressing s key)
• Switch to the descending/ascending order by pressing R key
• Demo (https://asciinema.org/a/247180)
htop features
Sort the metrics by the field values

39 © Cloudera, Inc. All rights reserved.
• ex) NAMESPACE==default, REQ/S>1000
• Operators: = (only needs a partial match), == (needs a exact match), >, >=, <,
<=, !
• o key: Add a filter with ignore case
• O key: Add a filter with case sensitive
• ctrl + o key: Show current filters
• = key: Clear current filters
htop features
Filter with the field values

40 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247181)
htop features
Filter with the field values

41 © Cloudera, Inc. All rights reserved.
• Namespace -> Tables
• Table -> Regions
• RegionServer -> Regions
• Select a record (Namespace, Table or RegionServer) you want to drill down
and Press i key
htop features
Drill down

42 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247182)
htop features
Drill down

43 © Cloudera, Inc. All rights reserved.
• htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics()
• It needs to access only HBase Master
• If we add more metrics, we first need to add them to ClusterMetrics
• The metrics from JMX endpoints will give more metrics but it needs to access all
RegionServers, which might cause scalability issues
htop internals

44 © Cloudera, Inc. All rights reserved.
• Not committed yet and a work in progress
• Building htop for HBase 2.x
• The basic features have been implemented
• The remaining tasks for htop
• Some code refactoring
• Adding some tests
• Documentation
Current status of htop

45 © Cloudera, Inc. All rights reserved.
• Support branch-1
• Add more metrics so that we can see more information from htop
• Response time metrics ASAP
• The metrics per Column Family/User/Operation (GET, PUT, SCAN, etc.)
• System information like CPU usage and memory usage might be useful
• Useful features in Unix top command
• Color mapping
• Batch mode, etc.
htop in the future

THANK YOU

47 © Cloudera, Inc. All rights reserved.
Q & A

More Related Content

What's hot

Tame that Beast

Tame that Beast

Tame that BeastDataWorks Summit/Hadoop Summit

Ingest and Stream Processing - What will you choose?

Ingest and Stream Processing - What will you choose?

Ingest and Stream Processing - What will you choose?DataWorks Summit/Hadoop Summit

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit

HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...

HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...

HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...Michael Stack

HBaseConAsia2018 Track1-3: HBase at Xiaomi

HBaseConAsia2018 Track1-3: HBase at Xiaomi

HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack

Taming the Elephant: Efficient and Effective Apache Hadoop Management

Taming the Elephant: Efficient and Effective Apache Hadoop Management

Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit

HDFS Tiered Storage: Mounting Object Stores in HDFS

HDFS Tiered Storage: Mounting Object Stores in HDFS

HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit

Empower Data-Driven Organizations

Empower Data-Driven Organizations

Empower Data-Driven OrganizationsDataWorks Summit/Hadoop Summit

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...Michael Stack

Curb your insecurity with HDP

Curb your insecurity with HDP

Curb your insecurity with HDPDataWorks Summit/Hadoop Summit

HBaseConAsia2018 Track3-6: HBase at Meituan

HBaseConAsia2018 Track3-6: HBase at Meituan

HBaseConAsia2018 Track3-6: HBase at MeituanMichael Stack

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit

Dancing elephants - efficiently working with object stores from Apache Spark ...

Dancing elephants - efficiently working with object stores from Apache Spark ...

Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit

HDFS Analysis for Small Files

HDFS Analysis for Small Files

HDFS Analysis for Small FilesDataWorks Summit/Hadoop Summit

To The Cloud and Back: A Look At Hybrid Analytics

To The Cloud and Back: A Look At Hybrid Analytics

To The Cloud and Back: A Look At Hybrid AnalyticsDataWorks Summit/Hadoop Summit

Practice of large Hadoop cluster in China Mobile

Practice of large Hadoop cluster in China Mobile

Practice of large Hadoop cluster in China MobileDataWorks Summit

How T-Mobile Tamed Metron

How T-Mobile Tamed Metron

How T-Mobile Tamed MetronDataWorks Summit

What's hot (20)

Tame that Beast

Tame that Beast

Tame that Beast

Ingest and Stream Processing - What will you choose?

Ingest and Stream Processing - What will you choose?

Ingest and Stream Processing - What will you choose?

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Floating on a RAFT: HBase Durability with Apache Ratis

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

Managing Hadoop, HBase and Storm Clusters at Yahoo Scale

HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...

HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...

HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...

HBaseConAsia2018 Track1-3: HBase at Xiaomi

HBaseConAsia2018 Track1-3: HBase at Xiaomi

HBaseConAsia2018 Track1-3: HBase at Xiaomi

Taming the Elephant: Efficient and Effective Apache Hadoop Management

Taming the Elephant: Efficient and Effective Apache Hadoop Management

Taming the Elephant: Efficient and Effective Apache Hadoop Management

HDFS Tiered Storage: Mounting Object Stores in HDFS

HDFS Tiered Storage: Mounting Object Stores in HDFS

HDFS Tiered Storage: Mounting Object Stores in HDFS

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...

HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...

Empower Data-Driven Organizations

Empower Data-Driven Organizations

Empower Data-Driven Organizations

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...

Curb your insecurity with HDP

Curb your insecurity with HDP

Curb your insecurity with HDP

HBaseConAsia2018 Track3-6: HBase at Meituan

HBaseConAsia2018 Track3-6: HBase at Meituan

HBaseConAsia2018 Track3-6: HBase at Meituan

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Disaster Recovery and Cloud Migration for your Apache Hive Warehouse

Dancing elephants - efficiently working with object stores from Apache Spark ...

Dancing elephants - efficiently working with object stores from Apache Spark ...

Dancing elephants - efficiently working with object stores from Apache Spark ...

HDFS Analysis for Small Files

HDFS Analysis for Small Files

HDFS Analysis for Small Files

To The Cloud and Back: A Look At Hybrid Analytics

To The Cloud and Back: A Look At Hybrid Analytics

To The Cloud and Back: A Look At Hybrid Analytics

Practice of large Hadoop cluster in China Mobile

Practice of large Hadoop cluster in China Mobile

Practice of large Hadoop cluster in China Mobile

How T-Mobile Tamed Metron

How T-Mobile Tamed Metron

How T-Mobile Tamed Metron

Similar to Supporting Apache HBase : Troubleshooting and Supportability Improvements

HBase tales from the trenches

HBase tales from the trenches

HBase tales from the trencheswchevreuil

Operating and supporting HBase Clusters

Operating and supporting HBase Clusters

Operating and supporting HBase Clustersenissoz

Operating and Supporting Apache HBase Best Practices and Improvements

Operating and Supporting Apache HBase Best Practices and Improvements

Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit

Share point 2013’s distributed cache service 6.0 (1)

Share point 2013’s distributed cache service 6.0 (1)

Share point 2013’s distributed cache service 6.0 (1)Hexaware Technologies

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.

HDFS: Optimization, Stabilization and Supportability

HDFS: Optimization, Stabilization and Supportability

HDFS: Optimization, Stabilization and SupportabilityDataWorks Summit/Hadoop Summit

Hdfs 2016-hadoop-summit-dublin-v1

Hdfs 2016-hadoop-summit-dublin-v1

Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth

Inside MapR's M7

Inside MapR's M7

Inside MapR's M7MapR Technologies

Inside MapR's M7

Inside MapR's M7

Inside MapR's M7Ted Dunning

Hdfs 2016-hadoop-summit-san-jose-v4

Hdfs 2016-hadoop-summit-san-jose-v4

Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth

HBase New Features

HBase New Features

HBase New Featuresrxu

HBase BackupsHBaseCon

Tales from the Cloudera Field

Tales from the Cloudera Field

Tales from the Cloudera FieldHBaseCon

AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...

AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...

AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services

Hbase Backups: Backups in the Enterprise

Hbase Backups: Backups in the Enterprise

Hbase Backups: Backups in the EnterpriseSalesforce Engineering

Clug 2011 March web server optimisation

Clug 2011 March web server optimisation

Clug 2011 March web server optimisationgrooverdan

Fast SQL on Hadoop, really?

Fast SQL on Hadoop, really?

Fast SQL on Hadoop, really?DataWorks Summit

Web Speed And Scalability

Web Speed And Scalability

Web Speed And ScalabilityJason Ragsdale

HBaseCon 2015: HBase 2.0 and Beyond Panel

HBaseCon 2015: HBase 2.0 and Beyond Panel

HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon

Similar to Supporting Apache HBase : Troubleshooting and Supportability Improvements (20)

HBase tales from the trenches

HBase tales from the trenches

HBase tales from the trenches

Operating and supporting HBase Clusters

Operating and supporting HBase Clusters

Operating and supporting HBase Clusters

Operating and Supporting Apache HBase Best Practices and Improvements

Operating and Supporting Apache HBase Best Practices and Improvements

Operating and Supporting Apache HBase Best Practices and Improvements

Share point 2013’s distributed cache service 6.0 (1)

Share point 2013’s distributed cache service 6.0 (1)

Share point 2013’s distributed cache service 6.0 (1)

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera

HDFS: Optimization, Stabilization and Supportability

HDFS: Optimization, Stabilization and Supportability

HDFS: Optimization, Stabilization and Supportability

Hdfs 2016-hadoop-summit-dublin-v1

Hdfs 2016-hadoop-summit-dublin-v1

Hdfs 2016-hadoop-summit-dublin-v1

Inside MapR's M7

Inside MapR's M7

Inside MapR's M7

Inside MapR's M7

Inside MapR's M7

Inside MapR's M7

Hdfs 2016-hadoop-summit-san-jose-v4

Hdfs 2016-hadoop-summit-san-jose-v4

Hdfs 2016-hadoop-summit-san-jose-v4

HBase New Features

HBase New Features

HBase New Features

HBase Backups

Tales from the Cloudera Field

Tales from the Cloudera Field

Tales from the Cloudera Field

AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...

AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...

AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...

Hbase Backups: Backups in the Enterprise

Hbase Backups: Backups in the Enterprise

Hbase Backups: Backups in the Enterprise

Clug 2011 March web server optimisation

Clug 2011 March web server optimisation

Clug 2011 March web server optimisation

Fast SQL on Hadoop, really?

Fast SQL on Hadoop, really?

Fast SQL on Hadoop, really?

Web Speed And Scalability

Web Speed And Scalability

Web Speed And Scalability

HBaseCon 2015: HBase 2.0 and Beyond Panel

HBaseCon 2015: HBase 2.0 and Beyond Panel

HBaseCon 2015: HBase 2.0 and Beyond Panel

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

More from DataWorks Summit

Data Science Crash Course

Data Science Crash Course

Data Science Crash CourseDataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Managing the Dewey Decimal SystemDataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Security Framework for Multitenant ArchitectureDataWorks Summit

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google CloudDataWorks Summit

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near YouDataWorks Summit

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit

Applying Noisy Knowledge Graphs to Real Problems

Applying Noisy Knowledge Graphs to Real Problems

Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit

Open Source, Open Data: Driving Innovation in Smart Cities

Open Source, Open Data: Driving Innovation in Smart Cities

Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit

Big Data Technologies in Support of a Medical School Data Science Institute

Big Data Technologies in Support of a Medical School Data Science Institute

Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit

Hadoop Storage in the Cloud Native Era

Hadoop Storage in the Cloud Native Era

Hadoop Storage in the Cloud Native EraDataWorks Summit

More from DataWorks Summit (20)

Data Science Crash Course

Data Science Crash Course

Data Science Crash Course

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...

Managing the Dewey Decimal System

Managing the Dewey Decimal System

Managing the Dewey Decimal System

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

HBase Global Indexing to support large-scale data ingestion at Uber

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Supporting Apache HBase : Troubleshooting and Supportability Improvements

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Security Framework for Multitenant Architecture

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Presto: Optimizing Performance of SQL-on-Anything Engine

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Extending Twitter's Data Platform to Google Cloud

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Event-Driven Messaging and Actions using Apache Flink and Apache NiFi

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Computer Vision: Coming to a Store Near You

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...

Applying Noisy Knowledge Graphs to Real Problems

Applying Noisy Knowledge Graphs to Real Problems

Applying Noisy Knowledge Graphs to Real Problems

Open Source, Open Data: Driving Innovation in Smart Cities

Open Source, Open Data: Driving Innovation in Smart Cities

Open Source, Open Data: Driving Innovation in Smart Cities

Big Data Technologies in Support of a Medical School Data Science Institute

Big Data Technologies in Support of a Medical School Data Science Institute

Big Data Technologies in Support of a Medical School Data Science Institute

Hadoop Storage in the Cloud Native Era

Hadoop Storage in the Cloud Native Era

Hadoop Storage in the Cloud Native Era

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

"ML in Production",Oleksandr Bagan

"ML in Production",Oleksandr Bagan

"ML in Production",Oleksandr BaganFwdays

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Human Factors of XR: Using Human Factors to Design XR Systems

Human Factors of XR: Using Human Factors to Design XR Systems

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Nell’iperspazio con Rocket: il Framework Web di Rust!

Nell’iperspazio con Rocket: il Framework Web di Rust!

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptxhariprasad279825

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piececharlottematthew16

Install Stable Diffusion in windows machine

Install Stable Diffusion in windows machine

Install Stable Diffusion in windows machinePadma Pradeep

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Dev Dives: Streamline document processing with UiPath Studio Web

Dev Dives: Streamline document processing with UiPath Studio Web

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Vector Databases 101 - An introduction to the world of Vector Databases

Vector Databases 101 - An introduction to the world of Vector Databases

Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

AI as an Interface for Commercial Buildings

AI as an Interface for Commercial Buildings

AI as an Interface for Commercial BuildingsMemoori

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

"ML in Production",Oleksandr Bagan

"ML in Production",Oleksandr Bagan

"ML in Production",Oleksandr Bagan

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Human Factors of XR: Using Human Factors to Design XR Systems

Human Factors of XR: Using Human Factors to Design XR Systems

Human Factors of XR: Using Human Factors to Design XR Systems

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Nell’iperspazio con Rocket: il Framework Web di Rust!

Nell’iperspazio con Rocket: il Framework Web di Rust!

Nell’iperspazio con Rocket: il Framework Web di Rust!

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptx

Artificial intelligence in cctv survelliance.pptx

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

My INSURER PTE LTD - Insurtech Innovation Award 2024

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piece

Story boards and shot lists for my a level piece

Install Stable Diffusion in windows machine

Install Stable Diffusion in windows machine

Install Stable Diffusion in windows machine

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 365

Ensuring Technical Readiness For Copilot in Microsoft 365

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Gen AI in Business - Global Trends Report 2024.pdf

Dev Dives: Streamline document processing with UiPath Studio Web

Dev Dives: Streamline document processing with UiPath Studio Web

Dev Dives: Streamline document processing with UiPath Studio Web

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024

What's New in Teams Calling, Meetings and Devices March 2024

Vector Databases 101 - An introduction to the world of Vector Databases

Vector Databases 101 - An introduction to the world of Vector Databases

Vector Databases 101 - An introduction to the world of Vector Databases

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdf

Unraveling Multimodality with Large Language Models.pdf

AI as an Interface for Commercial Buildings

AI as an Interface for Commercial Buildings

AI as an Interface for Commercial Buildings

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Supporting Apache HBase : Troubleshooting and Supportability Improvements

1. Supporting Apache HBase Troubleshooting and Supportability Improvements

2. 2© Cloudera, Inc. All rights reserved. Who we are • Daisuke Kobayashi (d1ce_) • Customer support at Cloudera since 2012, focusing on HDFS and HBase specifically • Apache HBase contributor • Toshihiro Suzuki (brfrn169) • Apache HBase committer since 2018 • Sr. Software Engineer, Breakfix (HBase/Phoenix, HDFS) at Cloudera • Wrote and Published a book based on HBase for beginners in Japanese

3. 3© Cloudera, Inc. All rights reserved. Supporting HBase • Typical Troubleshooting Scenario with HBase • Fix performance degradation (Slowness) • Identify the reason of process being crashed • Fix inconsistencies

4. 4© Cloudera, Inc. All rights reserved. Agenda • General approach to HBase performance issues with existing tools • htop - Real-time monitoring tool for HBase

5. © Cloudera, Inc. All rights reserved. General approach to HBase performance issues with existing tools (Logs and metrics are strictly aligned to HBase 2.1 (CDH 6.2)

6. 6 © Cloudera, Inc. All rights reserved. • Performance issues are tough! • Typical reasons • “Hot Spot” Region • Region with Non-Local Data • Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk • Stop the world with long GC pauses in RegionServers • Slowness Due To High Processor Usage • Network Saturation, etc. • Source of truth • Logs (a lot!) • Metrics (a lot!) Troubleshooting Performance Issues

7. 7© Cloudera, Inc. All rights reserved. Approach to Performance Troubleshooting Source - https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools • Understanding the issue • Top-down • USE Method (epecifically, focusing on U and S in this talk)

8. 8© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer MemStoreBlockCache RPC System (Handlers / Queues) HDFS Client

9. 9© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache

10. 10© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client Cache Size Cache Eviction Ratio Flush Size Frequency of requests Memstore Size Frequency of flush RPC Processed Time, Queue Length & Time Flush Queue MemStoreBlockCache Frequency of blocking updates

11. 11© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache

12. 12© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Numer of RPC requests • Incremented by one by the following actions at the RPC server level • doReplayBatchOp, closeRegion, compactRegion, flushRegion, getOnlineRegion, getRegionInfo, getServerInfo, openRegion, rollWALWriter, bulkLoadHFile, prepareBulkLoad, get, multi, mutate, scan "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "totalRequestCount" : 167130, HBASE-21207 made the columns sortable! Master webui Raw metrics

13. 13© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • RPC queue length & request size "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "queueSize" : 619211, "numCallsInGeneralQueue" : 5, "numCallsInPriorityQueue" : 0, Queue for hight priority handlers to deal with admin requests and system table operation requests. # of handler is controlled by hbase.regionserver.metahandler.count Queue for normal handlers. # of handler is controlled by hbase.regionserver.handler.count Running count of the size in bytes of all outstanding calls whether currently executing or queued waiting to be run. RegionServer webui

14. 14© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "ProcessCallTime_num_ops" : 10961, "QueueCallTime_num_ops" : 10961, Cloudera Manager chart: select ipc_process_rate, ipc_queue_rate where roleType = REGIONSERVER Raw metrics • Number of processed/queued requests • If queued > processed, time to check thread dump

15. 15© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Observability Improvements • In case of slowness on scan.next() call, the target region name was unknown in the past. • HBASE-16972 improved the logging by adding ‘scandetails’.2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer"} 2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer","scandetails":"table: cluster_test region: cluster_test,19999998,1557654024101.db9b3c6211849f53e8857e55279b8d12."}

16. 16© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache

17. 17© Cloudera, Inc. All rights reserved. RegionServer webui Memstore Utilization & Saturation Raw metrics "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "memStoreSize" : 5372418924, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Regions", "Namespace_default_table_cluster_test_region_7cdc92fd59a4f1a96b431552d952560c_metric_memStoreSize" : 18295903, "Namespace_default_table_dice2_region_155bf45f338288ae19cc0e3841a5d013_metric_memStoreSize" : 0, "Namespace_default_table_cluster_test_region_d5349e089ff8129faa1e35dee2957e27_metric_memStoreSize" : 4642160, • Memstore size

18. 18© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select memstore_size where category = HREGION Memstore Utilization & Saturation Cloudera Manager chart: select total_memstore_size_across_hregions where roleType = REGIONSERVER Compare the total memsore size across RegionServers Compare across regions in size in a RegionServer

19. 19© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation • Log snippet where a flush finishes • Frequency of flush (per hour) 2019-04-13 01:28:56,376 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished flush of dataSize ~105.70 MB/110836931, heapSize ~105.85 MB/110989816, currentSize=2.94 MB/3084019 for 3db6134cedc326474801068c3cb4f2a9 in 1625ms, sequenceid=4255, compaction requested=true Cell’s data alone, key bytes and value bytes, that is going to be flushed. This can be allocated off-heap too. Cell’s data on-heap along with its metadata and index (overhead of Java objects) Cell’s data alone on-heap after the flushEncoded region name How long did the flush take to complete? # grep "Finished flush of" <rs_log> | grep -o "^2019-..-.. .." | uniq -c 81 2019-05-13 17 6 2019-05-13 18 113 2019-05-15 02 18 2019-05-15 04 27 2019-05-15 12 133 2019-05-15 19 5 2019-05-15 20 198 2019-05-15 22 91 2019-05-15 23

20. 20© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation 2019-05-13 17:12:08,001 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Blocking updates: global memstore heapsize 403.0 M is >= blocking 403.0 M 2019-05-13 17:12:10,809 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Memstore is above high water mark and block 2808ms 2019-05-13 17:12:10,809 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Unblocking updates for server host-10-17-101-197.coe.cloudera.com,22101,1557773899580 • Indication of blocked updates due to high memstore utilization • Global memstore > hbase.regionserver.global.memstore.size • A memstore > hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size Why were updates blocked? How long was it blocked? Blocking updates finished 19/05/20 07:39:22 INFO client.RpcRetryingCallerImpl: Call exception, tries=7, retries=11, started=8164 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.RegionTooBusyException: Over memstore limit=128.0M, regionName=d5860b5e1a35025b6aab68dff4d944aa, server=host-10-17-101- 198.coe.cloudera.com,22101,1558363100074

21. 21© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache

22. 22© Cloudera, Inc. All rights reserved. Blockcache Utilization & Saturation • Current block cache usage • Cache eviction "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheSize" : 406847872, "blockCacheFreeSize" : 6291459, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheEvictionCount" : 38257, Raw metrics RegionServer webui

23. 23© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select block_cache_free_size where roleType = REGIONSERVER Blockcache Utilization & Saturation Cloudera Manager chart: select block_cache_evicted_rate where roleType = REGIONSERVER Compare the free size across RegionServers Compare the evicted blocks ratio across RegionServers

24. 24© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache

25. 25© Cloudera, Inc. All rights reserved. HDFS Client Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "flushQueueLength" : 0, RegionServer webui Raw metrics Cloudera Manager chart: select flush_queue_size where roleType = REGIONSERVER • Flush queue size

26. © Cloudera, Inc. All rights reserved. htop – Real-Time Monitoring Tool for HBase

27. 27 © Cloudera, Inc. All rights reserved. • HBASE-11062 htop • Work in Progress! • Unix top-like tool • Real-time monitoring for hbase metrics htop overview

28. 28 © Cloudera, Inc. All rights reserved. • HBase UIs • The metrics of the moment • Can't see the metrics in time series • Ganglia/OpenTSDB/Cloudera Manager/Ambari Metrics (via Grafana) • The metrics in time series • Collecting the latest metrics takes a little bit time • htop • Real-time monitoring • A lot of features for real-time monitoring htop motivation

29. 29 © Cloudera, Inc. All rights reserved. htop motivation HBase UI Ganglia/OpenTSDB/ Cloudera Manager/ Ambari Metrics htop Metrics of the Moment ○ △ ○ Metrics in Time Series ☓ ○ ☓ Real-Time Monitoring △ △ ○

30. 30 © Cloudera, Inc. All rights reserved. htop features htop screen • Command to start htop: • $ hbase top • Similar to Unix top command • The metrics are refreshed in a certain period – 3 seconds by default • Vertical and Horizontal scrolling

31. 31 © Cloudera, Inc. All rights reserved. htop features htop screen • Demo (https://asciinema.org/a/247434)

32. 32 © Cloudera, Inc. All rights reserved. • Press d key and put a new refresh delay • We can also change the default refresh delay by specifying a command line argument: • ex) $ hbase top -delay 2 # the default refresh delay is 2 seconds htop features Change refresh delay

33. 33 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247447) htop features Change refresh delay

34. 34 © Cloudera, Inc. All rights reserved. • Press m key and choose mode • Namespace mode • metrics per Namespace • Table mode • metrics per Table • RegionServer mode • metrics per RegionServer • Region mode (default) • metrics per Region • We can also change the default mode by specifying a command line argument: • ex) $ hbase top -mode n # the default mode is Namespace mode htop features Metrics per Namespace/Table/RegionServer/Region

35. 35 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247177) htop features Metrics per Namespace/Table/RegionServer/Region

36. 36 © Cloudera, Inc. All rights reserved. • Press f key and choose displayed fields (by pressing space key) • We can also change the order of the fields in the same screen • Right key selects for move then Left key or Enter key comments htop features Choose displayed fields and change the order of fields

37. 37 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247306) htop features Choose displayed fields and change the order of fields

38. 38 © Cloudera, Inc. All rights reserved. • Press f key and choose a sort field (by pressing s key) • Switch to the descending/ascending order by pressing R key • Demo (https://asciinema.org/a/247180) htop features Sort the metrics by the field values

39. 39 © Cloudera, Inc. All rights reserved. • ex) NAMESPACE==default, REQ/S>1000 • Operators: = (only needs a partial match), == (needs a exact match), >, >=, <, <=, ! • o key: Add a filter with ignore case • O key: Add a filter with case sensitive • ctrl + o key: Show current filters • = key: Clear current filters htop features Filter with the field values

40. 40 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247181) htop features Filter with the field values

41. 41 © Cloudera, Inc. All rights reserved. • Namespace -> Tables • Table -> Regions • RegionServer -> Regions • Select a record (Namespace, Table or RegionServer) you want to drill down and Press i key htop features Drill down

42. 42 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247182) htop features Drill down

43. 43 © Cloudera, Inc. All rights reserved. • htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics() • It needs to access only HBase Master • If we add more metrics, we first need to add them to ClusterMetrics • The metrics from JMX endpoints will give more metrics but it needs to access all RegionServers, which might cause scalability issues htop internals

44. 44 © Cloudera, Inc. All rights reserved. • Not committed yet and a work in progress • Building htop for HBase 2.x • The basic features have been implemented • The remaining tasks for htop • Some code refactoring • Adding some tests • Documentation Current status of htop

45. 45 © Cloudera, Inc. All rights reserved. • Support branch-1 • Add more metrics so that we can see more information from htop • Response time metrics ASAP • The metrics per Column Family/User/Operation (GET, PUT, SCAN, etc.) • System information like CPU usage and memory usage might be useful • Useful features in Unix top command • Color mapping • Batch mode, etc. htop in the future

47. 47 © Cloudera, Inc. All rights reserved. Q & A

Editor's Notes

First of all, let us introduce ourselves. My name is Daisuke Kobayashi. My team mates call me just Dice, or DiceK as a nickname. I have been working at Cloudera based in Japan since 2012. I’m actually working as backline support now to help customers and also internal support folks to resolve complicated issues. I’m also an HBase contributor. Hello, my name is Toshihiro Suzuki. I’m an HBase committer since last year. And I’m a Sr. Software Engineer, Breakfix in the Support team at Cloudera. I mainly handle HBase/Phoenix and HDFS cases. I have written and published a book based on HBase for beginners in Japanese.
So what does supporting HBase mean by at Cloudera? At cloudera, we have a big HBase user base and the number of nodes is quite widespread, from 10 nodes to 100, and 1000 nodes. They report various types of issues to our support team every single day and our job is simple. Just fix the issue and answer their questions. If I could summarize the problems reported by customers, these are typical scenarios we usually see. Fixing performance degradation, identifying the reason of process being crashed, and also fixing inconsistencies which is well known issue either in HBase 1 and in 2. But in this talk, we will specifically focus on the first one.
From my side, I‘m gonna introduce the general approach to performance issues and will show existing tools we usually use in the context of HBase troubleshooting. Later on, from my colleague Toshi, he will be talking about a new tool he’s now developing. It’s more intuitive and efficient for troubleshooting in real time.
So, fixing performance issues is tough. This is because the number of nodes is different across customers, they definitely run different versions with different configurations, different types of datasets and diffrent use cases. They are all different. Various types of factors can lead to performance issues. Something like misconfigurations on HBase, unbalanced loads on regionservers, which is as known as hot spot, because of bad schema designs. Also all regionservers shoud be collocated with datanodes and if the particular region’s block doesn’t exist in the local datanode, it has to read the data remotely over another datanodes. Apart from that, there might be bad OS configuration, GC issues, hardware failures or network related issues. Another thing which makes it difficult to troubleshoot these issues is there are various information exposed through logs and metrics regarding how the HBase cluster performs. Whenever we analyze problems, we have to pick up right log snippets and metrics to correlate to the root cause. In order to take advantages from the logs and metrics, it is obvious that we need to understand what they actually mean, why they are logged? and also when a particular metric is incremented? It's also important to understand what they are not. For core HBase developers, these questions may be easy to answer, but HBase is widespread and used by many users at various types of industry. Over last couple years, I have been asked about the meaning of given metrics and log snippets over and over. So the aim of my talk is to share these basic information with others to help them to be able to narrow down the problems and dig into further.
So, to start performance troubleshooting, I think these are the typical and important approach. First off, we need to listen to customers in order to understand what they are complaining and what they are hitting, and also what they wanna resolve. This is the very first and important step to be on the same page with them, In order to narrow down performance issues, in general we should look at the system with top - down approarch. Specifically in HBase, we fist look at the cluster itself and see how resource usages are distributed across nodes. If something looks going wrong on a particular nodes, we need to dig into the node. All though the troubleshooting step, I like using the USE method, which is originally defined by Brendan Gregg at Netflix and ex-Sun guy. The USE method is designed like an emergency checklist in a flight manual. So it’s intended to be simple, straightforward, complete, and fast. USE stands for Utilization, Saturation, and Errors. Utilization carries a question how busy is the particular resource? Saturation can be measured as the length of a wait queue, or time spent waiting on the queue. The Errors are explicit indications of something going wrong. It is obvious the USE method is not perfect, but it can be used as the very first checklist to identify the bottolneck quickly as possible. So, the next question is what are the resources in HBase. You know RegionServer is the worker role and responsible for processing read and write requests
These are the typical resources in a single regionserver.
All user requests are coming into the rpc system first, they are queued and processed by handlers concurrently. For caching it goes to the memstore for write or block cache for read. The data is persisted to HDFS at some conditions. As you know the requests always go with the direction of the orange arrow. Which means we should always follow this way when checking resources.
So what typs of informations are exposed by each resource? For example at the rpc system, it exposes the number of requests, how many requests getting queued and processed. For memstore, it exposes the memstore size, what’s the size of flushed memstore, and also the frequency of flush. So, using these observability items, we can check how the resource is utlilized and saturated. From the next slides, let’s walk through each resource one by one
First, the RPC system
From this slide, I’m gonna show you the metrics, webui, and also logs that’s used for troubleshooting. Please note that all those are aligned to HBase 2.1 code base, more specifically CDH 6.2. As I mentioned, the RPC system is the place where all client requests arrive. So, we should be able to check how many number of requests are received by every single regionserver. Here in the gray area, I’m showing the raw metric that is exposed via JMX endpoint on a paritcular regionserver. The total request count is also exposed through the Master and regionserver webui. We can just simply compare the requests across regionservers. If there’s an outstanding value, it’s a chance to narrow down to the particular regionserver. If you have been managing HBase and familiar with these webuis, you may be aware that the columns in the table are sortable. This is a simple but powerful change. We often have a screen sharing session with a customer to see the issue in a real time fashion. Every time we look at these webui, it was difficult to figure out the highest or the lowest servers without doing something tricky stuff. So this sorting functionality should make our life easier. This number is incremented by various types of request call at the RPC server level as describing in the slide.
Next, to understand the saturation, the number of requests being queued at a particular point in time is exposed. That is what I’m showing in the gray area as raw metrics and the corresponding values in the webui below. As meta table is usually accessed frequently than others, it’s isolated from the queue for normal regions. If the queue size is constantly growing, it may be indicating something going wrong in processing the requests.
We can check how many requests are processed and queued so far by the RPC system. I’m showing the raw metric value in the gray area. Since it’s just an incremental value, Cloudera Manager converts this value into rate, which make it easy to understand how things are going over time. Ideally, both processed and queued should be same. The processed is the blue graph and the queued is the green one in this example. We can see both exactly matches since as things are going well. If the queued becomes bigger than processed, it’s the sign of RPC handlers getting slow with some reason. We should check the thread dump to dig into further
If the RPC system takes longer than 10 seconds to respond back for a given request, it informs the table and the region name in the process logs. However, in case of scan next call is slow, none of the target region name or row key was informed so we were really frustrated while troubleshooting. Fortunately, recent version gets this improved by logging the scan details as I'm showing with green makrer in the second example. With this hint, we should be able to narrow down to the particular region to see why it’s slow.
Alright, next let’s take a look at memstore.
Memstore utilization is exposed via several levels, from server, tables, and regions. Here I'm showing the server and the region level raw metrics along with the corresponding webui. I think it’s fairly easy to understand the memstore utilization
When using Cloudera Manager, we typically use this sort of queries to compare the total memstore utilization across regionservers. The above graph is indicating it. Also we can check if there’s any outstanding region which utilizes memstore than other regions in a single regionserver, which is in the below graph.
Flush persists data in memstore into the underlying HDFS, which means the memstore is fully utilized, or most likely saturated. This is an example of log snippet where a flush finishes. In HBase 2 data can be allocated off-heap for both read and write. Given this, the log informs the pure key-value data size and the on-heap occupation separately. It’s also showing how long does it take to flush. These numbers should be informative to see how a particular flush goes. If it takes longer, it may be time to look at the HDFS performance too. Using this granular logging of flush, we can see the frequency of flush activity on a regionserver. In this example, I'm grouping the output on an hourly basis.
If the total memstore size across regions in a single regionserver goes beyond the limit of global memstore size, all updates are blocked by the regionserver until the utilization gets decreased less than the threshold. This is a typical log message in HBase 2.1. There are three lines where each correlates. The first line indicates blocking updates started because the global memstore size becomes greater than blocking threashold. The second line shows how long it took, and the third line indicates blocking completed. In the second example, the client gets the RegionTooBusyException for the particular region. This is because this region has too big memstore in size which is not flushed yet. This is also a typical indication of saturation regarding the specific memstore.
In the context of block cache, utilization is a simple cache usage which is available via raw metrics and also via webui. If a cache is evicted, in general, it means it’s saturated. I’m showing the raw metrics on the left hand side and the corresponding webui informations on the right hand side. From the top, it’s indicating how much the block cache resource is used and what’s the remaining memory for cache, and the number of evicted blocks.
Using Cloudera Manager, we can check the eviction rate, which is converted from the raw metric value. I’m showing an example in the graph below. If the utilization is higher enough, but the eviction rate is also higher, it’s the sign of block cache size is too small to handle the current workload appropriately. So it's time to think about increasing the cache size.
Alright, I’m gonna quickly cover the last resource in the picture. The HDFS resource utilization and saturation are basically tracked at the HDFS level metrics and logs. So I can't talk much in this session, but I am gonna show one related metric exposed at the HBase level.
That’s flush queue size. When flusing memstore, it’s queued first and persisted to HDFS later. The queue is maintained at the regionserver level and exposed as a metric through webui. It’s visible through Cloudera Manager chart as well. Typically, its utilization shouldn’t be grown, so if the queue is constantly growing it’s denoting flush is failing or getting slow with some reason. So it's time to look at the HDFS size. That’s pretty much all I have prepared for this presentation. Alright, I have been talking about how to look at the resources in Hbase and their utilization and saturation mainly from metrics and sometimes from logs. I’m pretty sure that I couldn’t cover everything. We have to look further using different approach if we couldn’t find anything bad with this approach, but I wish you could find an idea from my talk. From Toshi, he’s gonna give a presentation about a new tool which should make our life better.
From my side, I’m going to talk about htop that’s a Real-Time Monitoring Tool for HBase.
So, overview of htop. htop is the tool I’m developing now, which is raised in the JIRA ticket, HBASE-11062. This is an Unix top-like tool, and we can do real-time monitoring for the hbase metrics with it.
And, the motivation of htop. As Dice mentioned, a first approach when we are facing performance issues is to check the current status of the cluster. At this time, we can see HBase UIs to check the metrics. And it shows the metrics of the moment, but we can't see them in time series from it. If you want to see the metrics in time series, we have Ganglia, OpenTSDB, Cloudera Manager and Ambari Metrics. In Ambari metrics, we can see the metrics via Grafana. They are useful when we want to see the metrics in time series, but if you're going to do real-time monitoring, they are not very useful because collecting the latest metrics takes a little bit time in those tools. For real-time monitoring, I have started to develop htop. I’ll explain the features of htop later in this talk.
To clarify the position of htop, I made this matrix of the features of those tools. If you just want to see the metrics of the moment, you can use any tool of them. However, in Ganglia, OpenTSDB, Cloudera Manager and Ambari Metrics, collecting the latest metrics takes a little bit time. If you want to see the metrics in time series, you need to use Ganglia, OpenTSDB, Cloudera Manager or Ambari Metrics. And If you want to do real-time monitoring, htop is the most useful of them as it has a lot of features to do that.
From here, I will talk about the features of htop with demonstrations. Firstly, about htop screen. We can start htop by running hbase top command. The UI is similar to Unix top command. The metrics are refreshed in a certain period – 3 seconds by default And you can do vertical and horizontal scrolling.
I’ll show you demo of htop screen. Actually, this is not a live demo, but a terminal recording. And we can see this demo anytime in this URL. To start htop, run hbase top command. This is the screen of htop. The metrics in this screen are refreshed per 3 seconds. It consists of 2 parts, Summary part and Metrics part. In Summary part, you can see the HBase version, cluster ID, the number of region servers, the region count, Average Cluster Load and aggregated Request count per second. In Metrics part, you can see the metrics. In this case, you can see the metrics per region and it shows naamesapce name, table name, encoded region name, RegionServer name, Request count per second, read request count per second and so on. You can scroll down to see all metrics like this. you can also do horizontal scrolling like this.
As mentioned, the refresh delay is 3 seconds by default. But you can change it by pressing ‘d’ key and put the new refresh delay. And we can also change the default refresh delay by specifying a command line argument “-delay”
I’ll show you the demo of it. If you press ‘d’ key in htop screen, you can put a new refresh delay. In this demo, trying to change it to 1 seconds. Yeah, it has been changed.
And next. Currently, htop can show the metrics per Namespace, Table, RegionServer and Region. And they are called respectively Namespace mode, Table mode, RegionServer mode and Region mode. The default is region mode. We can change this mode by pressing ‘m’ key in htop screen. And we can also change the default mode by specifying a command line argument “-mode”
So, I’ll show you demo of it. Now, you see the metrics per region, and we can change it to Namespace or Table or RegionServer by pressing ‘m’ key. For example, we can see the metrics per Namespace like this or you can also see the metrics per Table like this.
In addition to that, we can choose which fields are displayed in the screen. By pressing ‘f’ key, you can choose displayed fields. We can also change the order of fields in the same screen.
I’ll show you the demo of it. By pressing ‘f’ key, move to this screen where you can choose displayed fields. For now, in region mode, these fields here can be displayed. And For example, if you don’t need Namespace and Table fields, and if you need Region name field, then you can remove and add these fields like this. And as you can see, the fields are removed and added. Also, we can change the order of fields in the same screen. Go back to the screen by pressing ’f’ key, and select the field you want to move and press Right key. And then move the field to anywhere you want to move it and press Left key. So you can see the order of the fields is changed.
It’s also possible to sort the metrics by the field values. And we can switch to descending or ascending order by pressing ‘R’ key. I’ll show you demo of it. Press ‘f’ key to move to the previous screen. And you can also choose a sort field on the same screen. If you want to sort the metrics by “Request count per second,” choose the field and press ‘s’ key. So the current sort field is changed to “Request count per second” And then you can see the metrics sorted by the field.
So next is Filter feature that’s very important. For example, if you want to see the metrics of “default” Namespace only, you can specify this filter NAMESPACE==default. Or if you want to see the metrics that have more then 1000 requests per second, then you can specify a filter like this REQ/S>1000 In this Filter feature, we can use the general operators like those: When we press o key in the htop screen, we can add a filter with ignore case. When we press O key, we can add a filter with case sensitive. Also, when we press ctrl + o key, we can see the current filters. And, when we press = key, we can clear the current filters.
Let me show you demo of it. If you want to see the metrics in “default” namespace only, press ’o’ key and you can specify a filter like this. As you can see, only the metrics in “default” Namespace are shown now. And, if you want to see the metrics of the ”test” table only, press ’o’ key again and you can add a filter like this. So now only the metrics in “default” Namespace and “test” table are shown. Furthermore, if you want the metrics that have more than 1000 requests, then you can add a filter like this. So, we can see only the metrics more than 1000 requests. We can see the specified filters by pressing ctrl + ‘o‘ key like this. These are the current filters. We can clear the current filters by pressing ‘=’ key like this. The filters are cleared.
The last feature I’d like to introduce here is the drill-down feature. We can drill down from Namespace to Tables, from Table to Regions, or from RegionServer to Regions. With this feature, we can find the “Hot Spot” region easily. We can drill down by selecting a record you want to drill down and pressing i key.
I’ll show you demo of it. If you want to drill down the “default” namespace to the tables, you can move to the namespace mode and select the “default” namespace and then press ‘i’ key. So you can see the metrics for the tables in the “default” namespace. Furthermore, if you want to drill down from the “test” table to the regions, select “test” table and press ‘i’ key, so you can see the metrics for the regions of the “test” table. Similarly, you can drill down from a RegionServer to regions. Move to the RegionServer mode and select one of the RegionServers and press ‘i’ key. So you can see the metrics for the regions on the selected RegionServer. That’s it for the demonstrations of the features of htop.
Next, let me talk about the internals of htop. Currently, htop gets the metrics from ClusterMetrics class from Admin.getCusterMetrics method because that needs to access only HBase Master to do that. So if we add more metrics to htop, we first need to add more metrics to ClusterMetrics class. Actually, the metrics from JMX endpoints will give more metrics to us, but it needs to access all RegionServers, which might cause scalability issues. So I decided not to use JMX endpoints for htop.
In this slide, I’ll talk about the current status of htop. As mentioned, htop hasn’t been committed yet, and it’s a work in progress actually. However, the basic features have been implemented as I showed you in the demonstrations. The remaining tasks for it are some code refactoring and adding some tests. I also need to make documentation for it. Maybe, it will be ready for review next month, and once the review is passed, it will be committed.
And, htop in the future. Currently, I’m developing this tool for the master branch and branch-2. So as a next step, we need to support branch-1. And we should add more metrics so that we can see more information from htop. Especially, adding response time metrics is required because they are very important for performance troubleshooting. And we can add the metrics per Column Family, User and Operation like GET, PUT, SCAN. And I’m thinking about adding system information like CPU usage and memory usage, which might be useful. In addition to that, we can add the useful features in Unix top command like Color mappings or Batch mode.
That’s all from my side. We hope this presentation was informative for you. Thank you very much.
We have a few minutes for Q & A. Any Questions?