SlideShare a Scribd company logo
1 of 18
Download to read offline
© Cloudera, Inc. All rights reserved.
HBase Tales From the Trenches
Wellington Chevreuil
© Cloudera, Inc. All rights reserved.
Agenda
• Common types of problems
• Most affected features
• Common reasons
• Case stories review
• General best practices
© Cloudera, Inc. All rights reserved.
Common types of problems
• RegionServers/Master crashes
• Performance
• Regions stuck in transition (RIT)
• FileSystem usage
• Corruption / Data loss / Replication data consistency
© Cloudera, Inc. All rights reserved.
Most affected features or sub-systems
• AssignmentManager
• Replication
• Snapshot
• Region Server Memory Management
• Master Initialisation
• WAL/StoreFile
• Custom applications / filters / co-processors
© Cloudera, Inc. All rights reserved.
Common reasons
• Performance
• Overload/under-dimensioned cluster
• Non optimal configurations
• Crashes
• Memory exhaustion
• File system issues
• Known bugs
• RIT
• Bugs
• Self induced (wrong hbck commands triggered)
© Cloudera, Inc. All rights reserved.
Common reasons
• RIT (cont.)
• Can also happen as side effect of performance, crash or corruption issues
• File system usage exhaustion
• Replication related issues
• Too many snapshots
• Corruption / Data loss / Replication data consistency
• Bugs
• Faulty peers / custom or third party components
• FileSystem problems
© Cloudera, Inc. All rights reserved.
Case Stories Review
© Cloudera, Inc. All rights reserved.
Case story - RegionServers slow/crashing randomly
Type: Process crash | Service outage | Performance
Feature: Region Server Memory Management
Reason: Long GC pauses, due to mismatching heap sizes and workloads
Diagnosing:
• Frequent JvmPauseMonitor alerts on RSes logs
• Occasionally OOME on stdout
• Too many regions per RS (more than 200 regions)
• JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole)
Resolution:
• Initially increase the heap size, but CMS may experience slowness with large heaps
• For heaps larger than 20GB, general G1 recommendations from Cloudera engineering
blog post had provided good results
• Horizontal scaling by adding more RSes
© Cloudera, Inc. All rights reserved.
Case story - Slow scans, compactions delayed
Type: Performance
Feature: Store File
Reason: PrefixTree HFile encoding issues (HBASE-17375)
Diagnosing:
• Compaction queue piling up
• jstacks from RSes show below trace over several frames:
"regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0
tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000]
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214)
at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628)
Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding
Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards)
© Cloudera, Inc. All rights reserved.
Case story - My HBase is slow
Type: Performance
Feature: Client Application Read/Write operations
Reason:
• Dependency services underperforming
• Poor client implementation not reusing connections
• Faulty CPs or custom filters
Diagnosing:
• Client application/RSes jstacks
• General HBase stats such as: compaction queue size, data locality, cache hit ratio
• HDFS/ZK logs
Resolution: Usually require tunings on dependency services or redesign of
client application/custom CPs/Filters
© Cloudera, Inc. All rights reserved.
Case story - Client scans failing | HBCK reports
inconsistencies
Type: RIT
Feature: AssignmentManager
Reason: Various
• Misuse of hbck
• Snapshots cold backups out of hdfs
• Busy/overloaded clusters where regions keep moving constantly
Diagnosing:
• Evident from hbck reports/Master Web UI
• Master logs would show RS opening/hosting region
• RS holding region should have relevant error message logs
Resolution:
• There's no single recipe
• Each case may require a combination of hbck/hbck2 commands
© Cloudera, Inc. All rights reserved.
Case story - Master timing out during initialisation
Type: Master Crash | Service Outage
Feature: Master initialisation
Reason: Different bugs can cause procedures to pile up:
• HBASE-22263, HBASE-16488, HBASE-18109
Diagnosing:
• Listing "/hbase/MasterProcWALs" shows hundreds or more files
• Master times out and crashes before assigning namespace region
ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after
Resolution:
• Stop Master and clean "/hbase/MasterProcWALs" folder
• Caution when on hbase > 2.x releases
Mitigation:
• Increase init timeout, number of open region threads
© Cloudera, Inc. All rights reserved.
Case story - Replication lags
Type: Replication stuck
Feature: Replication
Reason: Single WAL entries with too many OPs, leading to RPCs larger than
"hbase.ipc.server.max.callqueue.size"
Diagnosing: Destination peer RSes showing type of log messages below
2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops,
last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call
queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on
region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
Resolution: requires wal copy and replay from source to destination, plus
manual znode cleanup
Mitigation: releases including HBASE-18027 would prevent this situation
© Cloudera, Inc. All rights reserved.
Case story - Client scan failing on specific regions
Type: HFile Corruption
Feature: Store File
Reason: Unknown
Diagnosing: Following errors when scanning specific regions
java.lang.InternalError: Could not decompress data. Input is invalid.
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at
Or
Resolution: Requires sideline of affecting files and re-ingestion of row keys
stored on this file. Potential data loss
© Cloudera, Inc. All rights reserved.
Case story - HBase is eating HDFS space
Type: FileSystem usage
Feature: Replication | Snapshot | Compaction | Cleaners
Reason: Various
• Replication: Slow, faulty or disabled peer, missing tables on remote peer
• Snapshot: Too many snapshots being retained
• Compaction and Cleaners threads stuck or not running
Diagnosing:
• Check usage for "archive" and "oldWALS"
• Master logs would show if cleaner threads are running
• Is replication stuck or lagging?
• How about snapshot retention policy?
Resolution:
• If cleaner threads are not running, restart master
• For disabled peers, only enabling it again, or remove it, if no replication is wanted
• Too many snapshots would require some cleaning or cold backup
• Replication lags reason may vary
© Cloudera, Inc. All rights reserved.
General best practices
• Heap usage monitoring
• Keep regions/RS on low hundreds
• Consider G1 GC for heaps > 20GB
• Data locality
• Adjust caching according to workload
• Compaction Policy (Consider offline compactions using CompactionTool)
• Consider an exclusive Zookeeper for HBase
© Cloudera, Inc. All rights reserved.
General best practices
• Adjust Master initialization timeout accordingly
• Consider increase number of "open region" handlers
• Define reasonable snapshot retention policy
• Caution with experimental/non stable features (snappy/prefixtree)
• Define deployment policy for custom applications/CPs/Filters
• Define patch/bug fix upgrades schedule
© Cloudera, Inc. All rights reserved.
Q&A

More Related Content

What's hot

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomynzhang
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive DataWorks Summit
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightDatabricks
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance TuningLars Hofhansl
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Operating Systems - A Primer
Operating Systems - A PrimerOperating Systems - A Primer
Operating Systems - A PrimerSaumil Shah
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLDatabricks
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in HiveDataWorks Summit
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Databricks
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit
 
Dive into ROP - a quick introduction to Return Oriented Programming
Dive into ROP - a quick introduction to Return Oriented ProgrammingDive into ROP - a quick introduction to Return Oriented Programming
Dive into ROP - a quick introduction to Return Oriented ProgrammingSaumil Shah
 

What's hot (20)

Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Nifi workshop
Nifi workshopNifi workshop
Nifi workshop
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Hive Anatomy
Hive AnatomyHive Anatomy
Hive Anatomy
 
What's new in apache hive
What's new in apache hive What's new in apache hive
What's new in apache hive
 
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow FlightThe Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
The Data Lake Engine Data Microservices in Spark using Apache Arrow Flight
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Operating Systems - A Primer
Operating Systems - A PrimerOperating Systems - A Primer
Operating Systems - A Primer
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQLBuilding a SIMD Supported Vectorized Native Engine for Spark SQL
Building a SIMD Supported Vectorized Native Engine for Spark SQL
 
Cost-based Query Optimization in Hive
Cost-based Query Optimization in HiveCost-based Query Optimization in Hive
Cost-based Query Optimization in Hive
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...How to understand and analyze Apache Hive query execution plan for performanc...
How to understand and analyze Apache Hive query execution plan for performanc...
 
Dive into ROP - a quick introduction to Return Oriented Programming
Dive into ROP - a quick introduction to Return Oriented ProgrammingDive into ROP - a quick introduction to Return Oriented Programming
Dive into ROP - a quick introduction to Return Oriented Programming
 

Similar to HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field.

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trencheswchevreuil
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxManish Maheshwari
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaManish Maheshwari
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaJason Shih
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Tingjaxconf
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingGreat Wide Open
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best PracticesVenu Anuganti
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera FieldHBaseCon
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBaseCon
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014Nick Dimiduk
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Perforce
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7Ted Dunning
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoopjdcryans
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Chris Nauroth
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldDataWorks Summit
 

Similar to HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field. (20)

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 

More from DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...DataWorks Summit
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsDataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 

Recently uploaded

UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewDianaGray10
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxFIDO Alliance
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....rightmanforbloodline
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireExakis Nelite
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024Lorenzo Miniero
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!Memoori
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)Wonjun Hwang
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...ScyllaDB
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data SciencePaolo Missier
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentationyogeshlabana357357
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctBrainSell Technologies
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهMohamed Sweelam
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch TuesdayIvanti
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfalexjohnson7307
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingScyllaDB
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxMarkSteadman7
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 

Recently uploaded (20)

UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024WebRTC and SIP not just audio and video @ OpenSIPS 2024
WebRTC and SIP not just audio and video @ OpenSIPS 2024
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
الأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهلهالأمن السيبراني - ما لا يسع للمستخدم جهله
الأمن السيبراني - ما لا يسع للمستخدم جهله
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 

HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field.

  • 1. © Cloudera, Inc. All rights reserved. HBase Tales From the Trenches Wellington Chevreuil
  • 2. © Cloudera, Inc. All rights reserved. Agenda • Common types of problems • Most affected features • Common reasons • Case stories review • General best practices
  • 3. © Cloudera, Inc. All rights reserved. Common types of problems • RegionServers/Master crashes • Performance • Regions stuck in transition (RIT) • FileSystem usage • Corruption / Data loss / Replication data consistency
  • 4. © Cloudera, Inc. All rights reserved. Most affected features or sub-systems • AssignmentManager • Replication • Snapshot • Region Server Memory Management • Master Initialisation • WAL/StoreFile • Custom applications / filters / co-processors
  • 5. © Cloudera, Inc. All rights reserved. Common reasons • Performance • Overload/under-dimensioned cluster • Non optimal configurations • Crashes • Memory exhaustion • File system issues • Known bugs • RIT • Bugs • Self induced (wrong hbck commands triggered)
  • 6. © Cloudera, Inc. All rights reserved. Common reasons • RIT (cont.) • Can also happen as side effect of performance, crash or corruption issues • File system usage exhaustion • Replication related issues • Too many snapshots • Corruption / Data loss / Replication data consistency • Bugs • Faulty peers / custom or third party components • FileSystem problems
  • 7. © Cloudera, Inc. All rights reserved. Case Stories Review
  • 8. © Cloudera, Inc. All rights reserved. Case story - RegionServers slow/crashing randomly Type: Process crash | Service outage | Performance Feature: Region Server Memory Management Reason: Long GC pauses, due to mismatching heap sizes and workloads Diagnosing: • Frequent JvmPauseMonitor alerts on RSes logs • Occasionally OOME on stdout • Too many regions per RS (more than 200 regions) • JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole) Resolution: • Initially increase the heap size, but CMS may experience slowness with large heaps • For heaps larger than 20GB, general G1 recommendations from Cloudera engineering blog post had provided good results • Horizontal scaling by adding more RSes
  • 9. © Cloudera, Inc. All rights reserved. Case story - Slow scans, compactions delayed Type: Performance Feature: Store File Reason: PrefixTree HFile encoding issues (HBASE-17375) Diagnosing: • Compaction queue piling up • jstacks from RSes show below trace over several frames: "regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0 tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214) at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628) Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards)
  • 10. © Cloudera, Inc. All rights reserved. Case story - My HBase is slow Type: Performance Feature: Client Application Read/Write operations Reason: • Dependency services underperforming • Poor client implementation not reusing connections • Faulty CPs or custom filters Diagnosing: • Client application/RSes jstacks • General HBase stats such as: compaction queue size, data locality, cache hit ratio • HDFS/ZK logs Resolution: Usually require tunings on dependency services or redesign of client application/custom CPs/Filters
  • 11. © Cloudera, Inc. All rights reserved. Case story - Client scans failing | HBCK reports inconsistencies Type: RIT Feature: AssignmentManager Reason: Various • Misuse of hbck • Snapshots cold backups out of hdfs • Busy/overloaded clusters where regions keep moving constantly Diagnosing: • Evident from hbck reports/Master Web UI • Master logs would show RS opening/hosting region • RS holding region should have relevant error message logs Resolution: • There's no single recipe • Each case may require a combination of hbck/hbck2 commands
  • 12. © Cloudera, Inc. All rights reserved. Case story - Master timing out during initialisation Type: Master Crash | Service Outage Feature: Master initialisation Reason: Different bugs can cause procedures to pile up: • HBASE-22263, HBASE-16488, HBASE-18109 Diagnosing: • Listing "/hbase/MasterProcWALs" shows hundreds or more files • Master times out and crashes before assigning namespace region ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after Resolution: • Stop Master and clean "/hbase/MasterProcWALs" folder • Caution when on hbase > 2.x releases Mitigation: • Increase init timeout, number of open region threads
  • 13. © Cloudera, Inc. All rights reserved. Case story - Replication lags Type: Replication stuck Feature: Replication Reason: Single WAL entries with too many OPs, leading to RPCs larger than "hbase.ipc.server.max.callqueue.size" Diagnosing: Destination peer RSes showing type of log messages below 2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops, last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure 2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because: Resolution: requires wal copy and replay from source to destination, plus manual znode cleanup Mitigation: releases including HBASE-18027 would prevent this situation
  • 14. © Cloudera, Inc. All rights reserved. Case story - Client scan failing on specific regions Type: HFile Corruption Feature: Store File Reason: Unknown Diagnosing: Following errors when scanning specific regions java.lang.InternalError: Could not decompress data. Input is invalid. at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method) at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at Or Resolution: Requires sideline of affecting files and re-ingestion of row keys stored on this file. Potential data loss
  • 15. © Cloudera, Inc. All rights reserved. Case story - HBase is eating HDFS space Type: FileSystem usage Feature: Replication | Snapshot | Compaction | Cleaners Reason: Various • Replication: Slow, faulty or disabled peer, missing tables on remote peer • Snapshot: Too many snapshots being retained • Compaction and Cleaners threads stuck or not running Diagnosing: • Check usage for "archive" and "oldWALS" • Master logs would show if cleaner threads are running • Is replication stuck or lagging? • How about snapshot retention policy? Resolution: • If cleaner threads are not running, restart master • For disabled peers, only enabling it again, or remove it, if no replication is wanted • Too many snapshots would require some cleaning or cold backup • Replication lags reason may vary
  • 16. © Cloudera, Inc. All rights reserved. General best practices • Heap usage monitoring • Keep regions/RS on low hundreds • Consider G1 GC for heaps > 20GB • Data locality • Adjust caching according to workload • Compaction Policy (Consider offline compactions using CompactionTool) • Consider an exclusive Zookeeper for HBase
  • 17. © Cloudera, Inc. All rights reserved. General best practices • Adjust Master initialization timeout accordingly • Consider increase number of "open region" handlers • Define reasonable snapshot retention policy • Caution with experimental/non stable features (snappy/prefixtree) • Define deployment policy for custom applications/CPs/Filters • Define patch/bug fix upgrades schedule
  • 18. © Cloudera, Inc. All rights reserved. Q&A