SlideShare a Scribd company logo
1 of 18
Download to read offline
© Cloudera, Inc. All rights reserved.
HBase Tales From the Trenches
Wellington Chevreuil
© Cloudera, Inc. All rights reserved.
Agenda
• Common types of problems
• Most affected features
• Common reasons
• Case stories review
• General best practices
© Cloudera, Inc. All rights reserved.
Common types of problems
• RegionServers/Master crashes
• Performance
• Regions stuck in transition (RIT)
• FileSystem usage
• Corruption / Data loss / Replication data consistency
© Cloudera, Inc. All rights reserved.
Most affected features or sub-systems
• AssignmentManager
• Replication
• Snapshot
• Region Server Memory Management
• Master Initialisation
• WAL/StoreFile
• Custom applications / filters / co-processors
© Cloudera, Inc. All rights reserved.
Common reasons
• Performance
• Overload/under-dimensioned cluster
• Non optimal configurations
• Crashes
• Memory exhaustion
• File system issues
• Known bugs
• RIT
• Bugs
• Self induced (wrong hbck commands triggered)
© Cloudera, Inc. All rights reserved.
Common reasons
• RIT (cont.)
• Can also happen as side effect of performance, crash or corruption issues
• File system usage exhaustion
• Replication related issues
• Too many snapshots
• Corruption / Data loss / Replication data consistency
• Bugs
• Faulty peers / custom or third party components
• FileSystem problems
© Cloudera, Inc. All rights reserved.
Case Stories Review
© Cloudera, Inc. All rights reserved.
Case story - RegionServers slow/crashing randomly
Type: Process crash | Service outage | Performance
Feature: Region Server Memory Management
Reason: Long GC pauses, due to mismatching heap sizes and workloads
Diagnosing:
• Frequent JvmPauseMonitor alerts on RSes logs
• Occasionally OOME on stdout
• Too many regions per RS (more than 200 regions)
• JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole)
Resolution:
• Initially increase the heap size, but CMS may experience slowness with large heaps
• For heaps larger than 20GB, general G1 recommendations from Cloudera engineering
blog post had provided good results
• Horizontal scaling by adding more RSes
© Cloudera, Inc. All rights reserved.
Case story - Slow scans, compactions delayed
Type: Performance
Feature: Store File
Reason: PrefixTree HFile encoding issues (HBASE-17375)
Diagnosing:
• Compaction queue piling up
• jstacks from RSes show below trace over several frames:
"regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0
tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000]
java.lang.Thread.State: RUNNABLE
at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214)
at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127)
at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278)
at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181)
at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108)
at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628)
Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding
Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards)
© Cloudera, Inc. All rights reserved.
Case story - My HBase is slow
Type: Performance
Feature: Client Application Read/Write operations
Reason:
• Dependency services underperforming
• Poor client implementation not reusing connections
• Faulty CPs or custom filters
Diagnosing:
• Client application/RSes jstacks
• General HBase stats such as: compaction queue size, data locality, cache hit ratio
• HDFS/ZK logs
Resolution: Usually require tunings on dependency services or redesign of
client application/custom CPs/Filters
© Cloudera, Inc. All rights reserved.
Case story - Client scans failing | HBCK reports
inconsistencies
Type: RIT
Feature: AssignmentManager
Reason: Various
• Misuse of hbck
• Snapshots cold backups out of hdfs
• Busy/overloaded clusters where regions keep moving constantly
Diagnosing:
• Evident from hbck reports/Master Web UI
• Master logs would show RS opening/hosting region
• RS holding region should have relevant error message logs
Resolution:
• There's no single recipe
• Each case may require a combination of hbck/hbck2 commands
© Cloudera, Inc. All rights reserved.
Case story - Master timing out during initialisation
Type: Master Crash | Service Outage
Feature: Master initialisation
Reason: Different bugs can cause procedures to pile up:
• HBASE-22263, HBASE-16488, HBASE-18109
Diagnosing:
• Listing "/hbase/MasterProcWALs" shows hundreds or more files
• Master times out and crashes before assigning namespace region
ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after
Resolution:
• Stop Master and clean "/hbase/MasterProcWALs" folder
• Caution when on hbase > 2.x releases
Mitigation:
• Increase init timeout, number of open region threads
© Cloudera, Inc. All rights reserved.
Case story - Replication lags
Type: Replication stuck
Feature: Replication
Reason: Single WAL entries with too many OPs, leading to RPCs larger than
"hbase.ipc.server.max.callqueue.size"
Diagnosing: Destination peer RSes showing type of log messages below
2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops,
last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call
queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on
region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure
2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because:
Resolution: requires wal copy and replay from source to destination, plus
manual znode cleanup
Mitigation: releases including HBASE-18027 would prevent this situation
© Cloudera, Inc. All rights reserved.
Case story - Client scan failing on specific regions
Type: HFile Corruption
Feature: Store File
Reason: Unknown
Diagnosing: Following errors when scanning specific regions
java.lang.InternalError: Could not decompress data. Input is invalid.
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method)
at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239)
org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at
Or
Resolution: Requires sideline of affecting files and re-ingestion of row keys
stored on this file. Potential data loss
© Cloudera, Inc. All rights reserved.
Case story - HBase is eating HDFS space
Type: FileSystem usage
Feature: Replication | Snapshot | Compaction | Cleaners
Reason: Various
• Replication: Slow, faulty or disabled peer, missing tables on remote peer
• Snapshot: Too many snapshots being retained
• Compaction and Cleaners threads stuck or not running
Diagnosing:
• Check usage for "archive" and "oldWALS"
• Master logs would show if cleaner threads are running
• Is replication stuck or lagging?
• How about snapshot retention policy?
Resolution:
• If cleaner threads are not running, restart master
• For disabled peers, only enabling it again, or remove it, if no replication is wanted
• Too many snapshots would require some cleaning or cold backup
• Replication lags reason may vary
© Cloudera, Inc. All rights reserved.
General best practices
• Heap usage monitoring
• Keep regions/RS on low hundreds
• Consider G1 GC for heaps > 20GB
• Data locality
• Adjust caching according to workload
• Compaction Policy (Consider offline compactions using CompactionTool)
• Consider an exclusive Zookeeper for HBase
© Cloudera, Inc. All rights reserved.
General best practices
• Adjust Master initialization timeout accordingly
• Consider increase number of "open region" handlers
• Define reasonable snapshot retention policy
• Caution with experimental/non stable features (snappy/prefixtree)
• Define deployment policy for custom applications/CPs/Filters
• Define patch/bug fix upgrades schedule
© Cloudera, Inc. All rights reserved.
Q&A

More Related Content

What's hot

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
DataWorks Summit
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
enissoz
 

What's hot (20)

Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Simplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache KuduSimplifying Real-Time Architectures for IoT with Apache Kudu
Simplifying Real-Time Architectures for IoT with Apache Kudu
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Tuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBaseTuning Apache Phoenix/HBase
Tuning Apache Phoenix/HBase
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...
 
Apache Pulsar Development 101 with Python
Apache Pulsar Development 101 with PythonApache Pulsar Development 101 with Python
Apache Pulsar Development 101 with Python
 
Presto
PrestoPresto
Presto
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...HTTP Analytics for 6M requests per second using ClickHouse, by  Alexander Boc...
HTTP Analytics for 6M requests per second using ClickHouse, by Alexander Boc...
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos AlgorithmSolving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
Solving Hadoop Replication Challenges with an Active-Active Paxos Algorithm
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Financial Event Sourcing at Enterprise Scale
Financial Event Sourcing at Enterprise ScaleFinancial Event Sourcing at Enterprise Scale
Financial Event Sourcing at Enterprise Scale
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 

Similar to HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field.

Similar to HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field. (20)

HBase tales from the trenches
HBase tales from the trenchesHBase tales from the trenches
HBase tales from the trenches
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Strata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptxStrata London 2019 Scaling Impala.pptx
Strata London 2019 Scaling Impala.pptx
 
Strata London 2019 Scaling Impala
Strata London 2019 Scaling ImpalaStrata London 2019 Scaling Impala
Strata London 2019 Scaling Impala
 
Real-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using ImpalaReal-time Big Data Analytics Engine using Impala
Real-time Big Data Analytics Engine using Impala
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
Building an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen TingBuilding an Impenetrable ZooKeeper - Kathleen Ting
Building an Impenetrable ZooKeeper - Kathleen Ting
 
Troubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed DebuggingTroubleshooting Hadoop: Distributed Debugging
Troubleshooting Hadoop: Distributed Debugging
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBase: Where Online Meets Low Latency
HBase: Where Online Meets Low LatencyHBase: Where Online Meets Low Latency
HBase: Where Online Meets Low Latency
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
Hadoop Operations
Hadoop OperationsHadoop Operations
Hadoop Operations
 
Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale Still All on One Server: Perforce at Scale
Still All on One Server: Perforce at Scale
 
Introduction to HBase
Introduction to HBaseIntroduction to HBase
Introduction to HBase
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Inside MapR's M7
Inside MapR's M7Inside MapR's M7
Inside MapR's M7
 
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in HadoopKudu: Resolving Transactional and Analytic Trade-offs in Hadoop
Kudu: Resolving Transactional and Analytic Trade-offs in Hadoop
 
Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5Hadoop operations-2015-hadoop-summit-san-jose-v5
Hadoop operations-2015-hadoop-summit-san-jose-v5
 
Hadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the FieldHadoop Operations - Best Practices from the Field
Hadoop Operations - Best Practices from the Field
 

More from DataWorks Summit

HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 
Applying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real ProblemsApplying Noisy Knowledge Graphs to Real Problems
Applying Noisy Knowledge Graphs to Real Problems
 

Recently uploaded

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

HBase Tales From the Trenches - Short stories about most common HBase operation/deployment problems seen in the field.

  • 1. © Cloudera, Inc. All rights reserved. HBase Tales From the Trenches Wellington Chevreuil
  • 2. © Cloudera, Inc. All rights reserved. Agenda • Common types of problems • Most affected features • Common reasons • Case stories review • General best practices
  • 3. © Cloudera, Inc. All rights reserved. Common types of problems • RegionServers/Master crashes • Performance • Regions stuck in transition (RIT) • FileSystem usage • Corruption / Data loss / Replication data consistency
  • 4. © Cloudera, Inc. All rights reserved. Most affected features or sub-systems • AssignmentManager • Replication • Snapshot • Region Server Memory Management • Master Initialisation • WAL/StoreFile • Custom applications / filters / co-processors
  • 5. © Cloudera, Inc. All rights reserved. Common reasons • Performance • Overload/under-dimensioned cluster • Non optimal configurations • Crashes • Memory exhaustion • File system issues • Known bugs • RIT • Bugs • Self induced (wrong hbck commands triggered)
  • 6. © Cloudera, Inc. All rights reserved. Common reasons • RIT (cont.) • Can also happen as side effect of performance, crash or corruption issues • File system usage exhaustion • Replication related issues • Too many snapshots • Corruption / Data loss / Replication data consistency • Bugs • Faulty peers / custom or third party components • FileSystem problems
  • 7. © Cloudera, Inc. All rights reserved. Case Stories Review
  • 8. © Cloudera, Inc. All rights reserved. Case story - RegionServers slow/crashing randomly Type: Process crash | Service outage | Performance Feature: Region Server Memory Management Reason: Long GC pauses, due to mismatching heap sizes and workloads Diagnosing: • Frequent JvmPauseMonitor alerts on RSes logs • Occasionally OOME on stdout • Too many regions per RS (more than 200 regions) • JVM Heap usage charts show wide heap usage (JVisualVM/Jconsole) Resolution: • Initially increase the heap size, but CMS may experience slowness with large heaps • For heaps larger than 20GB, general G1 recommendations from Cloudera engineering blog post had provided good results • Horizontal scaling by adding more RSes
  • 9. © Cloudera, Inc. All rights reserved. Case story - Slow scans, compactions delayed Type: Performance Feature: Store File Reason: PrefixTree HFile encoding issues (HBASE-17375) Diagnosing: • Compaction queue piling up • jstacks from RSes show below trace over several frames: "regionserver/hadoop30-r5.phx.impactradius.net/10.16.20.138:60020-longCompactions-1550194360449" #111 prio=5 os_prio=0 tid=0x00007fcb429b3800 nid=0x243a8 runnable [0x00007fc3481c7000] java.lang.Thread.State: RUNNABLE at org.apache.hadoop.hbase.codec.prefixtree.decode.PrefixTreeArrayScanner.advance(PrefixTreeArrayScanner.java:214) at org.apache.hadoop.hbase.codec.prefixtree.PrefixTreeSeeker.next(PrefixTreeSeeker.java:127) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2$EncodedScannerV2.next(HFileReaderV2.java:1278) at org.apache.hadoop.hbase.regionserver.StoreFileScanner.next(StoreFileScanner.java:181) at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:108) at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:628) Resolution: Disable table, manual compact it with CompactionTool, disable PrefixTree encoding Mitigation: Disable PrefixTree encoding (Not supported anymore from 2.0 onwards)
  • 10. © Cloudera, Inc. All rights reserved. Case story - My HBase is slow Type: Performance Feature: Client Application Read/Write operations Reason: • Dependency services underperforming • Poor client implementation not reusing connections • Faulty CPs or custom filters Diagnosing: • Client application/RSes jstacks • General HBase stats such as: compaction queue size, data locality, cache hit ratio • HDFS/ZK logs Resolution: Usually require tunings on dependency services or redesign of client application/custom CPs/Filters
  • 11. © Cloudera, Inc. All rights reserved. Case story - Client scans failing | HBCK reports inconsistencies Type: RIT Feature: AssignmentManager Reason: Various • Misuse of hbck • Snapshots cold backups out of hdfs • Busy/overloaded clusters where regions keep moving constantly Diagnosing: • Evident from hbck reports/Master Web UI • Master logs would show RS opening/hosting region • RS holding region should have relevant error message logs Resolution: • There's no single recipe • Each case may require a combination of hbck/hbck2 commands
  • 12. © Cloudera, Inc. All rights reserved. Case story - Master timing out during initialisation Type: Master Crash | Service Outage Feature: Master initialisation Reason: Different bugs can cause procedures to pile up: • HBASE-22263, HBASE-16488, HBASE-18109 Diagnosing: • Listing "/hbase/MasterProcWALs" shows hundreds or more files • Master times out and crashes before assigning namespace region ERROR org.apache.hadoop.hbase.master.HMaster: Master failed to complete initialization after Resolution: • Stop Master and clean "/hbase/MasterProcWALs" folder • Caution when on hbase > 2.x releases Mitigation: • Increase init timeout, number of open region threads
  • 13. © Cloudera, Inc. All rights reserved. Case story - Replication lags Type: Replication stuck Feature: Replication Reason: Single WAL entries with too many OPs, leading to RPCs larger than "hbase.ipc.server.max.callqueue.size" Diagnosing: Destination peer RSes showing type of log messages below 2018-09-07 10:40:59,506 WARN org.apache.hadoop.hbase.client.AsyncProcess: #690, table=MY_TABLE, attempt=4/4 failed=2ops, last exception: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.CallQueueTooBigException): Call queue is full on /0.0.0.0:60020, is hbase.ipc.server.max.callqueue.size too small? on region-server-1.example.com,,60020,1524334173359, tracking started Fri Sep 07 10:35:53 IST 2018; not retrying 2 - final failure 2018-09-07 10:40:59,506 ERROR org.apache.hadoop.hbase.replication.regionserver.ReplicationSink: Unable to accept edit because: Resolution: requires wal copy and replay from source to destination, plus manual znode cleanup Mitigation: releases including HBASE-18027 would prevent this situation
  • 14. © Cloudera, Inc. All rights reserved. Case story - Client scan failing on specific regions Type: HFile Corruption Feature: Store File Reason: Unknown Diagnosing: Following errors when scanning specific regions java.lang.InternalError: Could not decompress data. Input is invalid. at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native Method) at org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:239) org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:163) Caused by: java.lang.NegativeArraySizeException at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1718) at org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1542) at Or Resolution: Requires sideline of affecting files and re-ingestion of row keys stored on this file. Potential data loss
  • 15. © Cloudera, Inc. All rights reserved. Case story - HBase is eating HDFS space Type: FileSystem usage Feature: Replication | Snapshot | Compaction | Cleaners Reason: Various • Replication: Slow, faulty or disabled peer, missing tables on remote peer • Snapshot: Too many snapshots being retained • Compaction and Cleaners threads stuck or not running Diagnosing: • Check usage for "archive" and "oldWALS" • Master logs would show if cleaner threads are running • Is replication stuck or lagging? • How about snapshot retention policy? Resolution: • If cleaner threads are not running, restart master • For disabled peers, only enabling it again, or remove it, if no replication is wanted • Too many snapshots would require some cleaning or cold backup • Replication lags reason may vary
  • 16. © Cloudera, Inc. All rights reserved. General best practices • Heap usage monitoring • Keep regions/RS on low hundreds • Consider G1 GC for heaps > 20GB • Data locality • Adjust caching according to workload • Compaction Policy (Consider offline compactions using CompactionTool) • Consider an exclusive Zookeeper for HBase
  • 17. © Cloudera, Inc. All rights reserved. General best practices • Adjust Master initialization timeout accordingly • Consider increase number of "open region" handlers • Define reasonable snapshot retention policy • Caution with experimental/non stable features (snappy/prefixtree) • Define deployment policy for custom applications/CPs/Filters • Define patch/bug fix upgrades schedule
  • 18. © Cloudera, Inc. All rights reserved. Q&A