SlideShare a Scribd company logo
1 of 47
Supporting Apache HBase
Troubleshooting and Supportability Improvements
2© Cloudera, Inc. All rights reserved.
Who we are
• Daisuke Kobayashi (d1ce_)
• Customer support at Cloudera since 2012, focusing
on HDFS and HBase specifically
• Apache HBase contributor
• Toshihiro Suzuki (brfrn169)
• Apache HBase committer since 2018
• Sr. Software Engineer, Breakfix (HBase/Phoenix, HDFS) at Cloudera
• Wrote and Published a book based on HBase for beginners in Japanese
3© Cloudera, Inc. All rights reserved.
Supporting HBase
• Typical Troubleshooting Scenario with HBase
• Fix performance degradation (Slowness)
• Identify the reason of process being crashed
• Fix inconsistencies
4© Cloudera, Inc. All rights reserved.
Agenda
• General approach to HBase performance issues with existing tools
• htop - Real-time monitoring tool for HBase
© Cloudera, Inc. All rights reserved.
General approach to HBase performance issues with existing
tools
(Logs and metrics are strictly aligned to HBase 2.1 (CDH 6.2)
6 © Cloudera, Inc. All rights reserved.
• Performance issues are tough!
• Typical reasons
• “Hot Spot” Region
• Region with Non-Local Data
• Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk
• Stop the world with long GC pauses in RegionServers
• Slowness Due To High Processor Usage
• Network Saturation, etc.
• Source of truth
• Logs (a lot!)
• Metrics (a lot!)
Troubleshooting Performance Issues
7© Cloudera, Inc. All rights reserved.
Approach to Performance Troubleshooting
Source -
https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools
• Understanding the issue
• Top-down
• USE Method (epecifically, focusing on U and S in this talk)
8© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
MemStoreBlockCache
RPC System (Handlers / Queues)
HDFS Client
9© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
10© Cloudera, Inc. All rights reserved.
Resources and Observability in RegionServer
RPC System (Handlers / Queues)
HDFS Client
Cache Size
Cache Eviction Ratio Flush Size
Frequency of requests
Memstore Size
Frequency of flush
RPC Processed Time, Queue Length & Time
Flush Queue
MemStoreBlockCache
Frequency of blocking updates
11© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
12© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• Numer of RPC requests
• Incremented by one by the following actions at the RPC server level
• doReplayBatchOp, closeRegion, compactRegion, flushRegion,
getOnlineRegion, getRegionInfo, getServerInfo, openRegion,
rollWALWriter, bulkLoadHFile, prepareBulkLoad, get, multi, mutate,
scan
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"totalRequestCount" : 167130,
HBASE-21207
made the columns
sortable!
Master webui
Raw metrics
13© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• RPC queue length & request size
"name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
"queueSize" : 619211,
"numCallsInGeneralQueue" : 5,
"numCallsInPriorityQueue" : 0,
Queue for hight priority handlers to deal
with admin requests and system table
operation requests.
# of handler is controlled by
hbase.regionserver.metahandler.count
Queue for normal handlers.
# of handler is controlled by
hbase.regionserver.handler.count
Running count of the size in bytes of all
outstanding calls whether currently executing or
queued waiting to be run.
RegionServer webui
14© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
"name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC",
"ProcessCallTime_num_ops" : 10961,
"QueueCallTime_num_ops" : 10961,
Cloudera Manager chart:
select ipc_process_rate, ipc_queue_rate
where roleType = REGIONSERVER
Raw metrics
• Number of processed/queued requests
• If queued > processed, time to check thread dump
15© Cloudera, Inc. All rights reserved.
RPC System Utilization & Saturation
• Observability Improvements
• In case of slowness on scan.next() call, the target region name was unknown
in the past.
• HBASE-16972 improved the logging by adding ‘scandetails’.2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt
imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679
number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true
client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690",
"queuetimems":0,"class":"HRegionServer"}
2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt
imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679
number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true
client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690",
"queuetimems":0,"class":"HRegionServer","scandetails":"table: cluster_test region:
cluster_test,19999998,1557654024101.db9b3c6211849f53e8857e55279b8d12."}
16© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
17© Cloudera, Inc. All rights reserved.
RegionServer webui
Memstore Utilization & Saturation
Raw metrics
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"memStoreSize" : 5372418924,
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Regions",
"Namespace_default_table_cluster_test_region_7cdc92fd59a4f1a96b431552d952560c_metric_memStoreSize" : 18295903,
"Namespace_default_table_dice2_region_155bf45f338288ae19cc0e3841a5d013_metric_memStoreSize" : 0,
"Namespace_default_table_cluster_test_region_d5349e089ff8129faa1e35dee2957e27_metric_memStoreSize" : 4642160,
• Memstore size
18© Cloudera, Inc. All rights reserved.
Cloudera Manager chart:
select memstore_size
where category = HREGION
Memstore Utilization & Saturation
Cloudera Manager chart:
select total_memstore_size_across_hregions
where roleType = REGIONSERVER
Compare the total memsore
size across RegionServers
Compare across regions
in size in a RegionServer
19© Cloudera, Inc. All rights reserved.
Memstore Utilization & Saturation
• Log snippet where a flush finishes
• Frequency of flush (per hour)
2019-04-13 01:28:56,376 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished flush of
dataSize ~105.70 MB/110836931, heapSize ~105.85 MB/110989816, currentSize=2.94 MB/3084019 for
3db6134cedc326474801068c3cb4f2a9 in 1625ms, sequenceid=4255, compaction requested=true
Cell’s data alone, key bytes and value
bytes, that is going to be flushed.
This can be allocated off-heap too.
Cell’s data on-heap along with its
metadata and index (overhead of
Java objects)
Cell’s data alone on-heap
after the flushEncoded region name
How long did the flush
take to complete?
# grep "Finished flush of" <rs_log> | grep -o "^2019-..-.. .." | uniq -c
81 2019-05-13 17
6 2019-05-13 18
113 2019-05-15 02
18 2019-05-15 04
27 2019-05-15 12
133 2019-05-15 19
5 2019-05-15 20
198 2019-05-15 22
91 2019-05-15 23
20© Cloudera, Inc. All rights reserved.
Memstore Utilization & Saturation
2019-05-13 17:12:08,001 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Blocking
updates: global memstore heapsize 403.0 M is >= blocking 403.0 M
2019-05-13 17:12:10,809 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Memstore is
above high water mark and block 2808ms
2019-05-13 17:12:10,809 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Unblocking
updates for server host-10-17-101-197.coe.cloudera.com,22101,1557773899580
• Indication of blocked updates due to high memstore utilization
• Global memstore > hbase.regionserver.global.memstore.size
• A memstore > hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size
Why were updates blocked? How long was it blocked? Blocking updates finished
19/05/20 07:39:22 INFO client.RpcRetryingCallerImpl: Call exception, tries=7, retries=11,
started=8164 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.RegionTooBusyException: Over
memstore limit=128.0M, regionName=d5860b5e1a35025b6aab68dff4d944aa, server=host-10-17-101-
198.coe.cloudera.com,22101,1558363100074
21© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
22© Cloudera, Inc. All rights reserved.
Blockcache Utilization & Saturation
• Current block cache usage
• Cache eviction
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"blockCacheSize" : 406847872,
"blockCacheFreeSize" : 6291459,
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"blockCacheEvictionCount" : 38257,
Raw metrics RegionServer webui
23© Cloudera, Inc. All rights reserved.
Cloudera Manager chart:
select block_cache_free_size
where roleType = REGIONSERVER
Blockcache Utilization & Saturation
Cloudera Manager chart:
select block_cache_evicted_rate
where roleType = REGIONSERVER
Compare the free size
across RegionServers
Compare the evicted
blocks ratio across
RegionServers
24© Cloudera, Inc. All rights reserved.
RPC System (Handlers / Queues)
HDFS Client
MemStoreBlockCache
25© Cloudera, Inc. All rights reserved.
HDFS Client Utilization & Saturation
"name" : "Hadoop:service=HBase,name=RegionServer,sub=Server",
"flushQueueLength" : 0,
RegionServer webui
Raw metrics
Cloudera Manager chart:
select flush_queue_size
where roleType = REGIONSERVER
• Flush queue size
© Cloudera, Inc. All rights reserved.
htop – Real-Time Monitoring Tool for HBase
27 © Cloudera, Inc. All rights reserved.
• HBASE-11062 htop
• Work in Progress!
• Unix top-like tool
• Real-time monitoring for hbase metrics
htop overview
28 © Cloudera, Inc. All rights reserved.
• HBase UIs
• The metrics of the moment
• Can't see the metrics in time series
• Ganglia/OpenTSDB/Cloudera Manager/Ambari Metrics (via Grafana)
• The metrics in time series
• Collecting the latest metrics takes a little bit time
• htop
• Real-time monitoring
• A lot of features for real-time monitoring
htop motivation
29 © Cloudera, Inc. All rights reserved.
htop motivation
HBase UI
Ganglia/OpenTSDB/
Cloudera Manager/
Ambari Metrics
htop
Metrics of the Moment ○ △ ○
Metrics in Time Series ☓ ○ ☓
Real-Time Monitoring △ △ ○
30 © Cloudera, Inc. All rights reserved.
htop features
htop screen
• Command to start htop:
• $ hbase top
• Similar to Unix top command
• The metrics are refreshed in a certain period – 3 seconds by default
• Vertical and Horizontal scrolling
31 © Cloudera, Inc. All rights reserved.
htop features
htop screen
• Demo (https://asciinema.org/a/247434)
32 © Cloudera, Inc. All rights reserved.
• Press d key and put a new refresh delay
• We can also change the default refresh delay by specifying a command line
argument:
• ex) $ hbase top -delay 2 # the default refresh delay is 2 seconds
htop features
Change refresh delay
33 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247447)
htop features
Change refresh delay
34 © Cloudera, Inc. All rights reserved.
• Press m key and choose mode
• Namespace mode
• metrics per Namespace
• Table mode
• metrics per Table
• RegionServer mode
• metrics per RegionServer
• Region mode (default)
• metrics per Region
• We can also change the default mode by specifying a command line argument:
• ex) $ hbase top -mode n # the default mode is Namespace mode
htop features
Metrics per Namespace/Table/RegionServer/Region
35 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247177)
htop features
Metrics per Namespace/Table/RegionServer/Region
36 © Cloudera, Inc. All rights reserved.
• Press f key and choose displayed fields (by pressing space key)
• We can also change the order of the fields in the same screen
• Right key selects for move then Left key or Enter key comments
htop features
Choose displayed fields and change the order of fields
37 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247306)
htop features
Choose displayed fields and change the order of fields
38 © Cloudera, Inc. All rights reserved.
• Press f key and choose a sort field (by pressing s key)
• Switch to the descending/ascending order by pressing R key
• Demo (https://asciinema.org/a/247180)
htop features
Sort the metrics by the field values
39 © Cloudera, Inc. All rights reserved.
• ex) NAMESPACE==default, REQ/S>1000
• Operators: = (only needs a partial match), == (needs a exact match), >, >=, <,
<=, !
• o key: Add a filter with ignore case
• O key: Add a filter with case sensitive
• ctrl + o key: Show current filters
• = key: Clear current filters
htop features
Filter with the field values
40 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247181)
htop features
Filter with the field values
41 © Cloudera, Inc. All rights reserved.
• Namespace -> Tables
• Table -> Regions
• RegionServer -> Regions
• Select a record (Namespace, Table or RegionServer) you want to drill down
and Press i key
htop features
Drill down
42 © Cloudera, Inc. All rights reserved.
• Demo (https://asciinema.org/a/247182)
htop features
Drill down
43 © Cloudera, Inc. All rights reserved.
• htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics()
• It needs to access only HBase Master
• If we add more metrics, we first need to add them to ClusterMetrics
• The metrics from JMX endpoints will give more metrics but it needs to access all
RegionServers, which might cause scalability issues
htop internals
44 © Cloudera, Inc. All rights reserved.
• Not committed yet and a work in progress
• Building htop for HBase 2.x
• The basic features have been implemented
• The remaining tasks for htop
• Some code refactoring
• Adding some tests
• Documentation
Current status of htop
45 © Cloudera, Inc. All rights reserved.
• Support branch-1
• Add more metrics so that we can see more information from htop
• Response time metrics ASAP
• The metrics per Column Family/User/Operation (GET, PUT, SCAN, etc.)
• System information like CPU usage and memory usage might be useful
• Useful features in Unix top command
• Color mapping
• Batch mode, etc.
htop in the future
THANK YOU
47 © Cloudera, Inc. All rights reserved.
Q & A

More Related Content

What's hot

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleDataWorks Summit/Hadoop Summit
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...Michael Stack
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiMichael Stack
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSDataWorks Summit
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...DataWorks Summit
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...Michael Stack
 
HBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanMichael Stack
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 

What's hot (20)

Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo ScaleManaging Hadoop, HBase and Storm Clusters at Yahoo Scale
Managing Hadoop, HBase and Storm Clusters at Yahoo Scale
 
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
 
HBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at XiaomiHBaseConAsia2018 Track1-3: HBase at Xiaomi
HBaseConAsia2018 Track1-3: HBase at Xiaomi
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
Startup Case Study: Leveraging the Broad Hadoop Ecosystem to Develop World-Fi...
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
HBaseConAsia2018 Track2-3: Bringing MySQL Compatibility to HBase using Databa...
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
HBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at MeituanHBaseConAsia2018 Track3-6: HBase at Meituan
HBaseConAsia2018 Track3-6: HBase at Meituan
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
HDFS Analysis for Small Files
HDFS Analysis for Small FilesHDFS Analysis for Small Files
HDFS Analysis for Small Files
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
How T-Mobile Tamed Metron
How T-Mobile Tamed MetronHow T-Mobile Tamed Metron
How T-Mobile Tamed Metron
 

Similar to Supporting Apache HBase : Troubleshooting and Supportability Improvements

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clustersenissoz
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsDataWorks Summit/Hadoop Summit
 
Share point 2013’s distributed cache service 6.0 (1)
Share point 2013’s distributed cache service 6.0 (1)Share point 2013’s distributed cache service 6.0 (1)
Share point 2013’s distributed cache service 6.0 (1)Hexaware Technologies
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Chris Nauroth
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
HBase New Features
HBase New FeaturesHBase New Features
HBase New Featuresrxu
 
HBase Backups
HBase BackupsHBase Backups
HBase BackupsHBaseCon
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...Amazon Web Services
 
Hbase Backups: Backups in the Enterprise
Hbase Backups: Backups in the EnterpriseHbase Backups: Backups in the Enterprise
Hbase Backups: Backups in the EnterpriseSalesforce Engineering
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisationgrooverdan
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?DataWorks Summit
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And ScalabilityJason Ragsdale
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightHBaseCon
 

Similar to Supporting Apache HBase : Troubleshooting and Supportability Improvements (20)

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
Operating and supporting HBase Clusters
Operating and supporting HBase ClustersOperating and supporting HBase Clusters
Operating and supporting HBase Clusters
 
Operating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and ImprovementsOperating and Supporting Apache HBase Best Practices and Improvements
Operating and Supporting Apache HBase Best Practices and Improvements
 
Share point 2013’s distributed cache service 6.0 (1)
Share point 2013’s distributed cache service 6.0 (1)Share point 2013’s distributed cache service 6.0 (1)
Share point 2013’s distributed cache service 6.0 (1)
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
HDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and SupportabilityHDFS: Optimization, Stabilization and Supportability
HDFS: Optimization, Stabilization and Supportability
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
HBase New Features
HBase New FeaturesHBase New Features
HBase New Features
 
HBase Backups
HBase BackupsHBase Backups
HBase Backups
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
AWS re:Invent 2016: Amazon CloudFront Flash Talks: Best Practices on Configur...
 
Hbase Backups: Backups in the Enterprise
Hbase Backups: Backups in the EnterpriseHbase Backups: Backups in the Enterprise
Hbase Backups: Backups in the Enterprise
 
Clug 2011 March web server optimisation
Clug 2011 March  web server optimisationClug 2011 March  web server optimisation
Clug 2011 March web server optimisation
 
Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?Fast SQL on Hadoop, really?
Fast SQL on Hadoop, really?
 
Web Speed And Scalability
Web Speed And ScalabilityWeb Speed And Scalability
Web Speed And Scalability
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsightOptimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight
 

More from DataWorks Summit

Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesDataWorks Summit
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteDataWorks Summit
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 
Open Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart CitiesOpen Source, Open Data: Driving Innovation in Smart Cities
Open Source, Open Data: Driving Innovation in Smart Cities
 
Big Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science InstituteBig Data Technologies in Support of a Medical School Data Science Institute
Big Data Technologies in Support of a Medical School Data Science Institute
 
Hadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native EraHadoop Storage in the Cloud Native Era
Hadoop Storage in the Cloud Native Era
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

Supporting Apache HBase : Troubleshooting and Supportability Improvements

  • 1. Supporting Apache HBase Troubleshooting and Supportability Improvements
  • 2. 2© Cloudera, Inc. All rights reserved. Who we are • Daisuke Kobayashi (d1ce_) • Customer support at Cloudera since 2012, focusing on HDFS and HBase specifically • Apache HBase contributor • Toshihiro Suzuki (brfrn169) • Apache HBase committer since 2018 • Sr. Software Engineer, Breakfix (HBase/Phoenix, HDFS) at Cloudera • Wrote and Published a book based on HBase for beginners in Japanese
  • 3. 3© Cloudera, Inc. All rights reserved. Supporting HBase • Typical Troubleshooting Scenario with HBase • Fix performance degradation (Slowness) • Identify the reason of process being crashed • Fix inconsistencies
  • 4. 4© Cloudera, Inc. All rights reserved. Agenda • General approach to HBase performance issues with existing tools • htop - Real-time monitoring tool for HBase
  • 5. © Cloudera, Inc. All rights reserved. General approach to HBase performance issues with existing tools (Logs and metrics are strictly aligned to HBase 2.1 (CDH 6.2)
  • 6. 6 © Cloudera, Inc. All rights reserved. • Performance issues are tough! • Typical reasons • “Hot Spot” Region • Region with Non-Local Data • Excessive I/O Wait Due To Swapping Or An Over-Worked Or Failing Hard Disk • Stop the world with long GC pauses in RegionServers • Slowness Due To High Processor Usage • Network Saturation, etc. • Source of truth • Logs (a lot!) • Metrics (a lot!) Troubleshooting Performance Issues
  • 7. 7© Cloudera, Inc. All rights reserved. Approach to Performance Troubleshooting Source - https://www.slideshare.net/brendangregg/velocity-2015-linux-perf-tools • Understanding the issue • Top-down • USE Method (epecifically, focusing on U and S in this talk)
  • 8. 8© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer MemStoreBlockCache RPC System (Handlers / Queues) HDFS Client
  • 9. 9© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  • 10. 10© Cloudera, Inc. All rights reserved. Resources and Observability in RegionServer RPC System (Handlers / Queues) HDFS Client Cache Size Cache Eviction Ratio Flush Size Frequency of requests Memstore Size Frequency of flush RPC Processed Time, Queue Length & Time Flush Queue MemStoreBlockCache Frequency of blocking updates
  • 11. 11© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  • 12. 12© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Numer of RPC requests • Incremented by one by the following actions at the RPC server level • doReplayBatchOp, closeRegion, compactRegion, flushRegion, getOnlineRegion, getRegionInfo, getServerInfo, openRegion, rollWALWriter, bulkLoadHFile, prepareBulkLoad, get, multi, mutate, scan "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "totalRequestCount" : 167130, HBASE-21207 made the columns sortable! Master webui Raw metrics
  • 13. 13© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • RPC queue length & request size "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "queueSize" : 619211, "numCallsInGeneralQueue" : 5, "numCallsInPriorityQueue" : 0, Queue for hight priority handlers to deal with admin requests and system table operation requests. # of handler is controlled by hbase.regionserver.metahandler.count Queue for normal handlers. # of handler is controlled by hbase.regionserver.handler.count Running count of the size in bytes of all outstanding calls whether currently executing or queued waiting to be run. RegionServer webui
  • 14. 14© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=IPC", "ProcessCallTime_num_ops" : 10961, "QueueCallTime_num_ops" : 10961, Cloudera Manager chart: select ipc_process_rate, ipc_queue_rate where roleType = REGIONSERVER Raw metrics • Number of processed/queued requests • If queued > processed, time to check thread dump
  • 15. 15© Cloudera, Inc. All rights reserved. RPC System Utilization & Saturation • Observability Improvements • In case of slowness on scan.next() call, the target region name was unknown in the past. • HBASE-16972 improved the logging by adding ‘scandetails’.2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer"} 2019-03-20 19:33:11,982 WARN org.apache.hadoop.hbase.ipc.RpcServer: (responseTooSlow): {"call":"Scan(org.apache.hadoop.hbase.shaded.protobuf.generated.ClientProtos$ScanRequest)","startt imems":1553110361981,"responsesize":63,"method":"Scan","param":"scanner_id: 2068237026033076679 number_of_rows: 100 close_scanner: false next_call_seq: 2 client_handles_partials: true client_handles_heartbeats: tru<TRUNCATED>","processingtimems":30000,"client":"10.1.1.6:34690", "queuetimems":0,"class":"HRegionServer","scandetails":"table: cluster_test region: cluster_test,19999998,1557654024101.db9b3c6211849f53e8857e55279b8d12."}
  • 16. 16© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  • 17. 17© Cloudera, Inc. All rights reserved. RegionServer webui Memstore Utilization & Saturation Raw metrics "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "memStoreSize" : 5372418924, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Regions", "Namespace_default_table_cluster_test_region_7cdc92fd59a4f1a96b431552d952560c_metric_memStoreSize" : 18295903, "Namespace_default_table_dice2_region_155bf45f338288ae19cc0e3841a5d013_metric_memStoreSize" : 0, "Namespace_default_table_cluster_test_region_d5349e089ff8129faa1e35dee2957e27_metric_memStoreSize" : 4642160, • Memstore size
  • 18. 18© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select memstore_size where category = HREGION Memstore Utilization & Saturation Cloudera Manager chart: select total_memstore_size_across_hregions where roleType = REGIONSERVER Compare the total memsore size across RegionServers Compare across regions in size in a RegionServer
  • 19. 19© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation • Log snippet where a flush finishes • Frequency of flush (per hour) 2019-04-13 01:28:56,376 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished flush of dataSize ~105.70 MB/110836931, heapSize ~105.85 MB/110989816, currentSize=2.94 MB/3084019 for 3db6134cedc326474801068c3cb4f2a9 in 1625ms, sequenceid=4255, compaction requested=true Cell’s data alone, key bytes and value bytes, that is going to be flushed. This can be allocated off-heap too. Cell’s data on-heap along with its metadata and index (overhead of Java objects) Cell’s data alone on-heap after the flushEncoded region name How long did the flush take to complete? # grep "Finished flush of" <rs_log> | grep -o "^2019-..-.. .." | uniq -c 81 2019-05-13 17 6 2019-05-13 18 113 2019-05-15 02 18 2019-05-15 04 27 2019-05-15 12 133 2019-05-15 19 5 2019-05-15 20 198 2019-05-15 22 91 2019-05-15 23
  • 20. 20© Cloudera, Inc. All rights reserved. Memstore Utilization & Saturation 2019-05-13 17:12:08,001 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Blocking updates: global memstore heapsize 403.0 M is >= blocking 403.0 M 2019-05-13 17:12:10,809 WARN org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Memstore is above high water mark and block 2808ms 2019-05-13 17:12:10,809 INFO org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Unblocking updates for server host-10-17-101-197.coe.cloudera.com,22101,1557773899580 • Indication of blocked updates due to high memstore utilization • Global memstore > hbase.regionserver.global.memstore.size • A memstore > hbase.hregion.memstore.block.multiplier * hbase.hregion.memstore.flush.size Why were updates blocked? How long was it blocked? Blocking updates finished 19/05/20 07:39:22 INFO client.RpcRetryingCallerImpl: Call exception, tries=7, retries=11, started=8164 ms ago, cancelled=false, msg=org.apache.hadoop.hbase.RegionTooBusyException: Over memstore limit=128.0M, regionName=d5860b5e1a35025b6aab68dff4d944aa, server=host-10-17-101- 198.coe.cloudera.com,22101,1558363100074
  • 21. 21© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  • 22. 22© Cloudera, Inc. All rights reserved. Blockcache Utilization & Saturation • Current block cache usage • Cache eviction "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheSize" : 406847872, "blockCacheFreeSize" : 6291459, "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "blockCacheEvictionCount" : 38257, Raw metrics RegionServer webui
  • 23. 23© Cloudera, Inc. All rights reserved. Cloudera Manager chart: select block_cache_free_size where roleType = REGIONSERVER Blockcache Utilization & Saturation Cloudera Manager chart: select block_cache_evicted_rate where roleType = REGIONSERVER Compare the free size across RegionServers Compare the evicted blocks ratio across RegionServers
  • 24. 24© Cloudera, Inc. All rights reserved. RPC System (Handlers / Queues) HDFS Client MemStoreBlockCache
  • 25. 25© Cloudera, Inc. All rights reserved. HDFS Client Utilization & Saturation "name" : "Hadoop:service=HBase,name=RegionServer,sub=Server", "flushQueueLength" : 0, RegionServer webui Raw metrics Cloudera Manager chart: select flush_queue_size where roleType = REGIONSERVER • Flush queue size
  • 26. © Cloudera, Inc. All rights reserved. htop – Real-Time Monitoring Tool for HBase
  • 27. 27 © Cloudera, Inc. All rights reserved. • HBASE-11062 htop • Work in Progress! • Unix top-like tool • Real-time monitoring for hbase metrics htop overview
  • 28. 28 © Cloudera, Inc. All rights reserved. • HBase UIs • The metrics of the moment • Can't see the metrics in time series • Ganglia/OpenTSDB/Cloudera Manager/Ambari Metrics (via Grafana) • The metrics in time series • Collecting the latest metrics takes a little bit time • htop • Real-time monitoring • A lot of features for real-time monitoring htop motivation
  • 29. 29 © Cloudera, Inc. All rights reserved. htop motivation HBase UI Ganglia/OpenTSDB/ Cloudera Manager/ Ambari Metrics htop Metrics of the Moment ○ △ ○ Metrics in Time Series ☓ ○ ☓ Real-Time Monitoring △ △ ○
  • 30. 30 © Cloudera, Inc. All rights reserved. htop features htop screen • Command to start htop: • $ hbase top • Similar to Unix top command • The metrics are refreshed in a certain period – 3 seconds by default • Vertical and Horizontal scrolling
  • 31. 31 © Cloudera, Inc. All rights reserved. htop features htop screen • Demo (https://asciinema.org/a/247434)
  • 32. 32 © Cloudera, Inc. All rights reserved. • Press d key and put a new refresh delay • We can also change the default refresh delay by specifying a command line argument: • ex) $ hbase top -delay 2 # the default refresh delay is 2 seconds htop features Change refresh delay
  • 33. 33 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247447) htop features Change refresh delay
  • 34. 34 © Cloudera, Inc. All rights reserved. • Press m key and choose mode • Namespace mode • metrics per Namespace • Table mode • metrics per Table • RegionServer mode • metrics per RegionServer • Region mode (default) • metrics per Region • We can also change the default mode by specifying a command line argument: • ex) $ hbase top -mode n # the default mode is Namespace mode htop features Metrics per Namespace/Table/RegionServer/Region
  • 35. 35 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247177) htop features Metrics per Namespace/Table/RegionServer/Region
  • 36. 36 © Cloudera, Inc. All rights reserved. • Press f key and choose displayed fields (by pressing space key) • We can also change the order of the fields in the same screen • Right key selects for move then Left key or Enter key comments htop features Choose displayed fields and change the order of fields
  • 37. 37 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247306) htop features Choose displayed fields and change the order of fields
  • 38. 38 © Cloudera, Inc. All rights reserved. • Press f key and choose a sort field (by pressing s key) • Switch to the descending/ascending order by pressing R key • Demo (https://asciinema.org/a/247180) htop features Sort the metrics by the field values
  • 39. 39 © Cloudera, Inc. All rights reserved. • ex) NAMESPACE==default, REQ/S>1000 • Operators: = (only needs a partial match), == (needs a exact match), >, >=, <, <=, ! • o key: Add a filter with ignore case • O key: Add a filter with case sensitive • ctrl + o key: Show current filters • = key: Clear current filters htop features Filter with the field values
  • 40. 40 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247181) htop features Filter with the field values
  • 41. 41 © Cloudera, Inc. All rights reserved. • Namespace -> Tables • Table -> Regions • RegionServer -> Regions • Select a record (Namespace, Table or RegionServer) you want to drill down and Press i key htop features Drill down
  • 42. 42 © Cloudera, Inc. All rights reserved. • Demo (https://asciinema.org/a/247182) htop features Drill down
  • 43. 43 © Cloudera, Inc. All rights reserved. • htop gets the metrics from ClusterMetrics from Admin.getClusterMetrics() • It needs to access only HBase Master • If we add more metrics, we first need to add them to ClusterMetrics • The metrics from JMX endpoints will give more metrics but it needs to access all RegionServers, which might cause scalability issues htop internals
  • 44. 44 © Cloudera, Inc. All rights reserved. • Not committed yet and a work in progress • Building htop for HBase 2.x • The basic features have been implemented • The remaining tasks for htop • Some code refactoring • Adding some tests • Documentation Current status of htop
  • 45. 45 © Cloudera, Inc. All rights reserved. • Support branch-1 • Add more metrics so that we can see more information from htop • Response time metrics ASAP • The metrics per Column Family/User/Operation (GET, PUT, SCAN, etc.) • System information like CPU usage and memory usage might be useful • Useful features in Unix top command • Color mapping • Batch mode, etc. htop in the future
  • 47. 47 © Cloudera, Inc. All rights reserved. Q & A

Editor's Notes

  1. First of all, let us introduce ourselves. My name is Daisuke Kobayashi. My team mates call me just Dice, or DiceK as a nickname. I have been working at Cloudera based in Japan since 2012. I’m actually working as backline support now to help customers and also internal support folks to resolve complicated issues. I’m also an HBase contributor. Hello, my name is Toshihiro Suzuki. I’m an HBase committer since last year. And I’m a Sr. Software Engineer, Breakfix in the Support team at Cloudera. I mainly handle HBase/Phoenix and HDFS cases. I have written and published a book based on HBase for beginners in Japanese.
  2. So what does supporting HBase mean by at Cloudera? At cloudera, we have a big HBase user base and the number of nodes is quite widespread, from 10 nodes to 100, and 1000 nodes. They report various types of issues to our support team every single day and our job is simple. Just fix the issue and answer their questions. If I could summarize the problems reported by customers, these are typical scenarios we usually see. Fixing performance degradation, identifying the reason of process being crashed, and also fixing inconsistencies which is well known issue either in HBase 1 and in 2. But in this talk, we will specifically focus on the first one.
  3. From my side, I‘m gonna introduce the general approach to performance issues and will show existing tools we usually use in the context of HBase troubleshooting. Later on, from my colleague Toshi, he will be talking about a new tool he’s now developing. It’s more intuitive and efficient for troubleshooting in real time.
  4. So, fixing performance issues is tough. This is because the number of nodes is different across customers, they definitely run different versions with different configurations, different types of datasets and diffrent use cases. They are all different. Various types of factors can lead to performance issues. Something like misconfigurations on HBase, unbalanced loads on regionservers, which is as known as hot spot, because of bad schema designs. Also all regionservers shoud be collocated with datanodes and if the particular region’s block doesn’t exist in the local datanode, it has to read the data remotely over another datanodes. Apart from that, there might be bad OS configuration, GC issues, hardware failures or network related issues. Another thing which makes it difficult to troubleshoot these issues is there are various information exposed through logs and metrics regarding how the HBase cluster performs. Whenever we analyze problems, we have to pick up right log snippets and metrics to correlate to the root cause. In order to take advantages from the logs and metrics, it is obvious that we need to understand what they actually mean, why they are logged? and also when a particular metric is incremented? It's also important to understand what they are not. For core HBase developers, these questions may be easy to answer, but HBase is widespread and used by many users at various types of industry. Over last couple years, I have been asked about the meaning of given metrics and log snippets over and over. So the aim of my talk is to share these basic information with others to help them to be able to narrow down the problems and dig into further.
  5. So, to start performance troubleshooting, I think these are the typical and important approach. First off, we need to listen to customers in order to understand what they are complaining and what they are hitting, and also what they wanna resolve. This is the very first and important step to be on the same page with them, In order to narrow down performance issues, in general we should look at the system with top - down approarch. Specifically in HBase, we fist look at the cluster itself and see how resource usages are distributed across nodes. If something looks going wrong on a particular nodes, we need to dig into the node. All though the troubleshooting step, I like using the USE method, which is originally defined by Brendan Gregg at Netflix and ex-Sun guy. The USE method is designed like an emergency checklist in a flight manual. So it’s intended to be simple, straightforward, complete, and fast. USE stands for Utilization, Saturation, and Errors. Utilization carries a question how busy is the particular resource? Saturation can be measured as the length of a wait queue, or time spent waiting on the queue. The Errors are explicit indications of something going wrong. It is obvious the USE method is not perfect, but it can be used as the very first checklist to identify the bottolneck quickly as possible. So, the next question is what are the resources in HBase. You know RegionServer is the worker role and responsible for processing read and write requests
  6. These are the typical resources in a single regionserver.
  7. All user requests are coming into the rpc system first, they are queued and processed by handlers concurrently. For caching it goes to the memstore for write or block cache for read. The data is persisted to HDFS at some conditions. As you know the requests always go with the direction of the orange arrow. Which means we should always follow this way when checking resources.
  8. So what typs of informations are exposed by each resource? For example at the rpc system, it exposes the number of requests, how many requests getting queued and processed. For memstore, it exposes the memstore size, what’s the size of flushed memstore, and also the frequency of flush. So, using these observability items, we can check how the resource is utlilized and saturated. From the next slides, let’s walk through each resource one by one
  9. First, the RPC system
  10. From this slide, I’m gonna show you the metrics, webui, and also logs that’s used for troubleshooting. Please note that all those are aligned to HBase 2.1 code base, more specifically CDH 6.2. As I mentioned, the RPC system is the place where all client requests arrive. So, we should be able to check how many number of requests are received by every single regionserver. Here in the gray area, I’m showing the raw metric that is exposed via JMX endpoint on a paritcular regionserver. The total request count is also exposed through the Master and regionserver webui. We can just simply compare the requests across regionservers. If there’s an outstanding value, it’s a chance to narrow down to the particular regionserver. If you have been managing HBase and familiar with these webuis, you may be aware that the columns in the table are sortable. This is a simple but powerful change. We often have a screen sharing session with a customer to see the issue in a real time fashion. Every time we look at these webui, it was difficult to figure out the highest or the lowest servers without doing something tricky stuff. So this sorting functionality should make our life easier. This number is incremented by various types of request call at the RPC server level as describing in the slide.
  11. Next, to understand the saturation, the number of requests being queued at a particular point in time is exposed. That is what I’m showing in the gray area as raw metrics and the corresponding values in the webui below. As meta table is usually accessed frequently than others, it’s isolated from the queue for normal regions. If the queue size is constantly growing, it may be indicating something going wrong in processing the requests.
  12. We can check how many requests are processed and queued so far by the RPC system. I’m showing the raw metric value in the gray area. Since it’s just an incremental value, Cloudera Manager converts this value into rate, which make it easy to understand how things are going over time. Ideally, both processed and queued should be same. The processed is the blue graph and the queued is the green one in this example. We can see both exactly matches since as things are going well. If the queued becomes bigger than processed, it’s the sign of RPC handlers getting slow with some reason. We should check the thread dump to dig into further
  13. If the RPC system takes longer than 10 seconds to respond back for a given request, it informs the table and the region name in the process logs. However, in case of scan next call is slow, none of the target region name or row key was informed so we were really frustrated while troubleshooting. Fortunately, recent version gets this improved by logging the scan details as I'm showing with green makrer in the second example. With this hint, we should be able to narrow down to the particular region to see why it’s slow.
  14. Alright, next let’s take a look at memstore.
  15. Memstore utilization is exposed via several levels, from server, tables, and regions. Here I'm showing the server and the region level raw metrics along with the corresponding webui. I think it’s fairly easy to understand the memstore utilization
  16. When using Cloudera Manager, we typically use this sort of queries to compare the total memstore utilization across regionservers. The above graph is indicating it. Also we can check if there’s any outstanding region which utilizes memstore than other regions in a single regionserver, which is in the below graph.
  17. Flush persists data in memstore into the underlying HDFS, which means the memstore is fully utilized, or most likely saturated. This is an example of log snippet where a flush finishes. In HBase 2 data can be allocated off-heap for both read and write. Given this, the log informs the pure key-value data size and the on-heap occupation separately. It’s also showing how long does it take to flush. These numbers should be informative to see how a particular flush goes. If it takes longer, it may be time to look at the HDFS performance too. Using this granular logging of flush, we can see the frequency of flush activity on a regionserver. In this example, I'm grouping the output on an hourly basis.
  18. If the total memstore size across regions in a single regionserver goes beyond the limit of global memstore size, all updates are blocked by the regionserver until the utilization gets decreased less than the threshold. This is a typical log message in HBase 2.1. There are three lines where each correlates. The first line indicates blocking updates started because the global memstore size becomes greater than blocking threashold. The second line shows how long it took, and the third line indicates blocking completed. In the second example, the client gets the RegionTooBusyException for the particular region. This is because this region has too big memstore in size which is not flushed yet. This is also a typical indication of saturation regarding the specific memstore.
  19. In the context of block cache, utilization is a simple cache usage which is available via raw metrics and also via webui. If a cache is evicted, in general, it means it’s saturated. I’m showing the raw metrics on the left hand side and the corresponding webui informations on the right hand side. From the top, it’s indicating how much the block cache resource is used and what’s the remaining memory for cache, and the number of evicted blocks.
  20. Using Cloudera Manager, we can check the eviction rate, which is converted from the raw metric value. I’m showing an example in the graph below. If the utilization is higher enough, but the eviction rate is also higher, it’s the sign of block cache size is too small to handle the current workload appropriately. So it's time to think about increasing the cache size.
  21. Alright, I’m gonna quickly cover the last resource in the picture. The HDFS resource utilization and saturation are basically tracked at the HDFS level metrics and logs. So I can't talk much in this session, but I am gonna show one related metric exposed at the HBase level.
  22. That’s flush queue size. When flusing memstore, it’s queued first and persisted to HDFS later. The queue is maintained at the regionserver level and exposed as a metric through webui. It’s visible through Cloudera Manager chart as well. Typically, its utilization shouldn’t be grown, so if the queue is constantly growing it’s denoting flush is failing or getting slow with some reason. So it's time to look at the HDFS size. That’s pretty much all I have prepared for this presentation. Alright, I have been talking about how to look at the resources in Hbase and their utilization and saturation mainly from metrics and sometimes from logs. I’m pretty sure that I couldn’t cover everything. We have to look further using different approach if we couldn’t find anything bad with this approach, but I wish you could find an idea from my talk. From Toshi, he’s gonna give a presentation about a new tool which should make our life better.
  23. From my side, I’m going to talk about htop that’s a Real-Time Monitoring Tool for HBase.
  24. So, overview of htop. htop is the tool I’m developing now, which is raised in the JIRA ticket, HBASE-11062. This is an Unix top-like tool, and we can do real-time monitoring for the hbase metrics with it.
  25. And, the motivation of htop. As Dice mentioned, a first approach when we are facing performance issues is to check the current status of the cluster. At this time, we can see HBase UIs to check the metrics. And it shows the metrics of the moment, but we can't see them in time series from it. If you want to see the metrics in time series, we have Ganglia, OpenTSDB, Cloudera Manager and Ambari Metrics. In Ambari metrics, we can see the metrics via Grafana. They are useful when we want to see the metrics in time series, but if you're going to do real-time monitoring, they are not very useful because collecting the latest metrics takes a little bit time in those tools. For real-time monitoring, I have started to develop htop. I’ll explain the features of htop later in this talk.
  26. To clarify the position of htop, I made this matrix of the features of those tools. If you just want to see the metrics of the moment, you can use any tool of them. However, in Ganglia, OpenTSDB, Cloudera Manager and Ambari Metrics, collecting the latest metrics takes a little bit time. If you want to see the metrics in time series, you need to use Ganglia, OpenTSDB, Cloudera Manager or Ambari Metrics. And If you want to do real-time monitoring, htop is the most useful of them as it has a lot of features to do that.
  27. From here, I will talk about the features of htop with demonstrations. Firstly, about htop screen. We can start htop by running hbase top command. The UI is similar to Unix top command. The metrics are refreshed in a certain period – 3 seconds by default And you can do vertical and horizontal scrolling.
  28. I’ll show you demo of htop screen. Actually, this is not a live demo, but a terminal recording. And we can see this demo anytime in this URL. To start htop, run hbase top command. This is the screen of htop. The metrics in this screen are refreshed per 3 seconds. It consists of 2 parts, Summary part and Metrics part. In Summary part, you can see the HBase version, cluster ID, the number of region servers, the region count, Average Cluster Load and aggregated Request count per second. In Metrics part, you can see the metrics. In this case, you can see the metrics per region and it shows naamesapce name, table name, encoded region name, RegionServer name, Request count per second, read request count per second and so on. You can scroll down to see all metrics like this. you can also do horizontal scrolling like this.
  29. As mentioned, the refresh delay is 3 seconds by default. But you can change it by pressing ‘d’ key and put the new refresh delay. And we can also change the default refresh delay by specifying a command line argument “-delay”
  30. I’ll show you the demo of it. If you press ‘d’ key in htop screen, you can put a new refresh delay. In this demo, trying to change it to 1 seconds. Yeah, it has been changed.
  31. And next. Currently, htop can show the metrics per Namespace, Table, RegionServer and Region. And they are called respectively Namespace mode, Table mode, RegionServer mode and Region mode. The default is region mode. We can change this mode by pressing ‘m’ key in htop screen. And we can also change the default mode by specifying a command line argument “-mode”
  32. So, I’ll show you demo of it. Now, you see the metrics per region, and we can change it to Namespace or Table or RegionServer by pressing ‘m’ key. For example, we can see the metrics per Namespace like this or you can also see the metrics per Table like this.
  33. In addition to that, we can choose which fields are displayed in the screen. By pressing ‘f’ key, you can choose displayed fields. We can also change the order of fields in the same screen.
  34. I’ll show you the demo of it. By pressing ‘f’ key, move to this screen where you can choose displayed fields. For now, in region mode, these fields here can be displayed. And For example, if you don’t need Namespace and Table fields, and if you need Region name field, then you can remove and add these fields like this. And as you can see, the fields are removed and added. Also, we can change the order of fields in the same screen. Go back to the screen by pressing ’f’ key, and select the field you want to move and press Right key. And then move the field to anywhere you want to move it and press Left key. So you can see the order of the fields is changed.
  35. It’s also possible to sort the metrics by the field values. And we can switch to descending or ascending order by pressing ‘R’ key. I’ll show you demo of it. Press ‘f’ key to move to the previous screen. And you can also choose a sort field on the same screen. If you want to sort the metrics by “Request count per second,” choose the field and press ‘s’ key. So the current sort field is changed to “Request count per second” And then you can see the metrics sorted by the field.
  36. So next is Filter feature that’s very important. For example, if you want to see the metrics of “default” Namespace only, you can specify this filter NAMESPACE==default. Or if you want to see the metrics that have more then 1000 requests per second, then you can specify a filter like this REQ/S>1000 In this Filter feature, we can use the general operators like those: When we press o key in the htop screen, we can add a filter with ignore case. When we press O key, we can add a filter with case sensitive. Also, when we press ctrl + o key, we can see the current filters. And, when we press = key, we can clear the current filters.
  37. Let me show you demo of it. If you want to see the metrics in “default” namespace only, press ’o’ key and you can specify a filter like this. As you can see, only the metrics in “default” Namespace are shown now. And, if you want to see the metrics of the ”test” table only, press ’o’ key again and you can add a filter like this. So now only the metrics in “default” Namespace and “test” table are shown. Furthermore, if you want the metrics that have more than 1000 requests, then you can add a filter like this. So, we can see only the metrics more than 1000 requests. We can see the specified filters by pressing ctrl + ‘o‘ key like this. These are the current filters. We can clear the current filters by pressing ‘=’ key like this. The filters are cleared.
  38. The last feature I’d like to introduce here is the drill-down feature. We can drill down from Namespace to Tables, from Table to Regions, or from RegionServer to Regions. With this feature, we can find the “Hot Spot” region easily. We can drill down by selecting a record you want to drill down and pressing i key.
  39. I’ll show you demo of it. If you want to drill down the “default” namespace to the tables, you can move to the namespace mode and select the “default” namespace and then press ‘i’ key. So you can see the metrics for the tables in the “default” namespace. Furthermore, if you want to drill down from the “test” table to the regions, select “test” table and press ‘i’ key, so you can see the metrics for the regions of the “test” table. Similarly, you can drill down from a RegionServer to regions. Move to the RegionServer mode and select one of the RegionServers and press ‘i’ key. So you can see the metrics for the regions on the selected RegionServer. That’s it for the demonstrations of the features of htop.
  40. Next, let me talk about the internals of htop. Currently, htop gets the metrics from ClusterMetrics class from Admin.getCusterMetrics method because that needs to access only HBase Master to do that. So if we add more metrics to htop, we first need to add more metrics to ClusterMetrics class. Actually, the metrics from JMX endpoints will give more metrics to us, but it needs to access all RegionServers, which might cause scalability issues. So I decided not to use JMX endpoints for htop.
  41. In this slide, I’ll talk about the current status of htop. As mentioned, htop hasn’t been committed yet, and it’s a work in progress actually. However, the basic features have been implemented as I showed you in the demonstrations. The remaining tasks for it are some code refactoring and adding some tests. I also need to make documentation for it. Maybe, it will be ready for review next month, and once the review is passed, it will be committed.
  42. And, htop in the future. Currently, I’m developing this tool for the master branch and branch-2. So as a next step, we need to support branch-1. And we should add more metrics so that we can see more information from htop. Especially, adding response time metrics is required because they are very important for performance troubleshooting. And we can add the metrics per Column Family, User and Operation like GET, PUT, SCAN. And I’m thinking about adding system information like CPU usage and memory usage, which might be useful. In addition to that, we can add the useful features in Unix top command like Color mappings or Batch mode.
  43. That’s all from my side. We hope this presentation was informative for you. Thank you very much.
  44. We have a few minutes for Q & A. Any Questions?