4 supporting h base   jeff, jon, kathleen - cloudera - final 2
 

Like this? Share it with your network

Share

4 supporting h base jeff, jon, kathleen - cloudera - final 2

on

  • 3,412 views

 

Statistics

Views

Total Views
3,412
Views on SlideShare
3,155
Embed Views
257

Actions

Likes
8
Downloads
101
Comments
0

3 Embeds 257

http://www.cloudera.com 233
http://marilson.pbworks.com 17
http://blog.cloudera.com 7

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • We get a lot of complaints about HBase. Here are some classic objections from real conversations:“I thought this was a scalable system!”“A process should never crash...”“Randomly!”“When it’s running on a node that swaps!”“(but only sometimes)”“When it can’t connect to another node!”“When it’s sharing a node with misbehaving processes”“When it’s colocated with MapReduce”EVER!“I thought Hadoop was meant to tolerate failures!”“HBASE CRASHES TOO MUCH!”“HBASE IS UNSTABLE!”I’m a big fan of Bloom County, and since Cloudera is a startup that sells its services and subscriptions to more mature companies, I often feel like Binkley trying to get his father to stop smoking. When people stand up HBase they often take the default configurations, or do what feels good, like allocate lots of MR slots per server running Hbase, causing Hbase to get sick. It’s important to sit down and discuss the recommendations.
  • Because we support a lot of Hbase clusters that get really unhealthy, we have the luxury of having taken some full color photos of unhealthy Hbase clusters. Common complaints, from real support tickets:“Why Did HBase fall down and go boom?”“en masse failure of HBase Region Servers”“Region Server Massive Failure”“Problems with Zookeeper and/or regionserver”“Regionservers becoming unavailable”“Hbase region servers crashing”“Hbase fail-over/reliability issues”“Yesterday Hbase daemons started crashing”“hbase RS keep crashing”“Hbaseregionservers shut down running large mapreduce jobs”“Hbase region server shut down one by one”Etc.
  • HBase has two requirements: strong consistency and low latency access.HBase is a distributed, scalable fault-tolerent system with dependencies on two other distributed, scalable, fault tolerant systems: HDFS and ZK.It’s the complexity of one distributed system depending on two other distributed systems in the light of strong consistency and low-latency access that leads to the perceived instability of Hbase. When a connection to or integration with either HDFS or ZK becomes unreliable, HBase does not trust that it can meet its requirements and crashes rather than serve old data or with bad latency.You also have a stack of dependencies: JVM and OS depends on the network/disk. HDFS and ZK depend on JVM and the OS. Hbase depends on ZK and HDFS. MR and the Hbase application require HBase being available. And this whole thing is a distributed system, so the dependencies have to be met between and across nodes as well.
  • Is this the right position?
  • First symptom: MRFailed tasksBlacklisted tasktrackersLong execution timeUltimate failure of the jobDue to Hbase:Dead region servers!Due to HDFS:Lease Expiry.Due to Zookeeper:Connection timeouts
  • First symptom: MRFailed tasksBlacklisted tasktrackersLong execution timeUltimate failure of the jobDue to Hbase:Dead region servers!Due to HDFS:Lease Expiry.Due to Zookeeper:Connection timeouts
  • When the preventative measures mentioned by Jeff are not taken, your cluster might land up in the ER with a critical production failure.
  • This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
  • Unlike a patient, a cluster can’t tell us what’s wrong. Therefore, when support tickets come in the first things we generally ask for are logs and configurations. These are what we look at for evidence and symptoms. Given that Hbase depends on both ZK and HDFS, a ZK or HDFS misconfig can cause erratic behavior in Hbase. Thus we need to look at those logs too.
  • Just as an ER triage doctor will treat patients based on their symptoms’ severity, we do the same with clusters.
  • HConnectionManager.deleteConnection(config,true);
  • A datanode has an upper bound on the number of files that it will serve at any one time, called xcievers. By increasing dfs.datanode.max.xcievers, you are increasing the max limit of transfer threads allowed at a DN. HBase, in particular, needs the extra transfer threads since its use of HDFS is very different from how it's used by MR. In HBase, data files are opened on cluster startup and kept open so that we avoid paying the file open costs on each access. This can lead to HDFS running out of file descriptors and datanode threads. Although there will be exceptions in logs, often you'll first see erratic behavior in HBase.
  • This is a deceiving error message since in some cases, it’s not caused by a longgc pause.In RSlog:INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, sessionTimeout=180000Seeing a different value in the ZKlog: INFO org.apache.zookeeper.server.ZooKeeperServer: Expiring session , timeout of 40000ms exceededThis needs to be set to the same value on the ZK servers as what is used by the HBase clients. Then restart ZK and Hbase.
  • @tlipcon if sum(max heap size) > physical RAM - 3GB, go directly to jail. do not pass go. do not collect $200.TT counted twice because it forks.JT not accounted for because it runs on a separate node.Hadoop streaming jobs don't pay attention to heap size specified in mapred.child.java.opts because they spawn processes. That's where ulimit takes over. OS needs 20% of heap due to optimizations provided by filesystem cache for HBase. MR jobs have a way of completely clearing out FS cache which can impact HBase performance.
  • It is not good to raise dfs.replication.min from its default of 1, because the moment you runinto a DFS replication issue or a single failure in pipeline, yourfile.close() is going to hang to guarantee minimumreplication.As a result, this form of hard-guarantee canbring down clusters of HBase during high xceiver load on DN, or disk fill-ups on many of them.
  • Given the many cluster resources leveraged by distributed ZooKeeper, it's frequently the first to notice issues affecting cluster health, which explains its moniker,the canary in the Hadoop coal mine.
  • The more members an ensemble has, the more tolerant the ensemble is of host failures.ZK achieves high-availability through replication, and can provide a service as long as a majority of the machines in the ensemble are up - this is why usually there are an odd number of machines in an ensemble.
  • Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied, as you saw, to the root cause. Keeping in mind these six common misconfigurations, it’ll be less painful when you configure your Hbase cluster. But if the pain persists, you know who to call.

4 supporting h base jeff, jon, kathleen - cloudera - final 2 Presentation Transcript

  • 1. Supporting HBase:How to Stabilize, Diagnose and Repair Jeff Bean, Jonathan Hsieh, Kathleen Ting {jwfbean,jon,kathleen}@cloudera.com 5/22/12 HBaseCon 2012. 5/22/12 Copyright 2012 Cloudera Inc. All rights reserved
  • 2. Who Are We?• Jeff Bean • Designated Support Engineer, Cloudera • Education Program Lead, Cloudera• Kathleen Ting • Support Manager, Cloudera • ZooKeeper Subject Matter Expert• Jonathan Hsieh • Software Engineer, Cloudera • Apache HBase Committer and PMC member HBaseCon 2012. 5/22/12 2 Copyright 2012 Cloudera Inc. All rights reserved
  • 3. Outline• Preventative HBase Medicine: • Tips for a healthy HBase• The HBase Triage: • Fixes for acute HBase pains• The HBase Surgery: • Repairing a Corrupted HBase HBaseCon 2012. 5/22/12 3 Copyright 2012 Cloudera Inc. All rights reserved
  • 4. Outline• Preventative HBase Medicine: • Tips for a healthy HBase• The HBase Triage: • Fixes for acute HBase pains• The HBase Surgery: • Repairing a Corrupted HBase HBaseCon 2012. 5/22/12 4 Copyright 2012 Cloudera Inc. All rights reserved
  • 5. “Monitor your system, exercise yourworkload, and eat your vegetables.” HBaseCon 2012. 5/22/12 5 Copyright 2012 Cloudera Inc. All rights reserved
  • 6. HBaseCon 2012. 5/22/12 6Copyright 2012 Cloudera Inc. All rights reserved
  • 7. HBaseCon 2012. 5/22/12 7Copyright 2012 Cloudera Inc. All rights reserved
  • 8. HBase Cross-Section App MR HBase ZooKeeper HDFS JVM / Linux Disk / Network HBaseCon 2012. 5/22/12 8 Copyright 2012 Cloudera Inc. All rights reserved
  • 9. Doctor’s Advice: “A ounce of prevention worth apound of cure.” • Understand your workload and test for it • Size your cluster properly (see Cluster Sizer) • Monitor, alert, and manage your cluster with Ganglia, Nagios, and/or Cloudera Manager • Don’t be Dr. House! HBaseCon 2012. 5/22/12 9 Copyright 2012 Cloudera Inc. All rights reserved
  • 10. A Case Study Copyright 2012 Cloudera Inc. All rights reserved 10
  • 11. Symptom: Long Running MapReduce job withblacklisted TaskTrackers App MR TaskTracker No. of Failures NodeX 4 HBase NodeY 3 NodeQ 7 ZK HDFS NodeB 10 NodeP 8 NodeV 6 JVM / Linux Disk / Network HBaseCon 2012. 5/22/12 11 Copyright 2012 Cloudera Inc. All rights reserved
  • 12. Symptom: Node B Task Logs$ find . | xargs grep "giving up“ App MR./attempt_201107261334_0221_m_000962_1/syslog:2011-08-02 11:09:34,248 INFO org.apache.hadoop.ipc.HbaseRPC: Server at NodeA:60020 could not be reached after 1 tries, giving up../attempt_201107261334_0221_m_000962_1/syslog:2011-08-02 11:09:37,328 INFO org.apache.hadoop.ipc.HbaseRPC: Server at HBase NodeA:60020 could not be reached after 1 tries, giving up../attempt_201107261334_0221_m_000962_1/syslog:2011-08-02 11:09:40,465 INFO org.apache.hadoop.ipc.HbaseRPC: Server at NodeA:60020 could not be reached after 1 tries, giving up. ZK HDFS JVM / Linux Disk / Network HBaseCon 2012. 5/22/12 12 Copyright 2012 Cloudera Inc. All rights reserved
  • 13. Symptom: RegionServer logs of Node A:2011-08-02 11:04:20,324 FATAL MR App org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server serverName=NodeA,60020,1312228900706, load=(requests=10847, regions=342, usedHeap=8193, HBase maxHeap=15350): regionserver:60020-0x4316487a73e1626 regionserver:60020- 0x4316487a73e1626 received expired from ZooKeeper, aborting ZK HDFSorg.apache.zookeeper.KeeperException$SessionExpiredEx ception: KeeperErrorCode = Session expired JVM / Linux Disk / Network HBaseCon 2012. 5/22/12 13 Copyright 2012 Cloudera Inc. All rights reserved
  • 14. Cascading failure! Some other node says ouch…2011-08-01 12:55:39,356 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=Thread[Thread- 15,5,main] App MR2011-08-01 12:55:39,629 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: STOPPED: Shutdown hook2011-08-01 12:55:39,629 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Starting fs shutdown hook thread.2011-08-01 12:55:39,695 ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file /hbase/.logsNodeA,60020,1311651881177NodeA%3A60020.1311656326143 : java.io.IOException: Error Recovery for block blk_1102151039331207284_16350929 HBase failed because recovery from primary datanode NodeA:50010 failed 6 times. Pipeline was NodeA:50010. Aborting...java.io.IOException: Error Recovery for block blk_1102151039331207284_16350929 failed because recovery from primary datanode NodeA:50010 failed 6 times. Pipeline was NodeA:50010. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSCli ZK HDFS ent.java:2841) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2 305) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.j ava:2477) JVM / Linux2011-08-01 12:55:39,842 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Shutdown hook finished. Disk / Network HBaseCon 2012. 5/22/12 14 Copyright 2012 Cloudera Inc. All rights reserved
  • 15. Symptom: Ganglia Memory Graph on Node A… App MR HBase ZK HDFS JVM / Linux Disk / Network HBaseCon 2012. 5/22/12 15 Copyright 2012 Cloudera Inc. All rights reserved
  • 16. Symptom: Ganglia swap_free on Node A… App MR HBase ZK HDFS JVM / Linux Disk / Network HBaseCon 2012. 5/22/12 16 Copyright 2012 Cloudera Inc. All rights reserved
  • 17. A Case study: Radiant Pain“I was having back pains, and it turned out to be my heart!” Masters Take Node A swaps• Too many MR Slots • MapReduce tasks fail Action• MR Slots too large • HDFS datanode • “Arbitrary” processes operations time out • JobTracker blacklists TT• Too many non-HBase pause or unresponsive on node B small files (HDFS-2379) • HBase client operations fail • Jobs fail or run slow • NameNode re-replicates blocks from node A Node A Under Node B can’t Load connect to node A HBaseCon 2012. 5/22/12 17 Copyright 2012 Cloudera Inc. All rights reserved
  • 18. Event Trail and Evidence Trail Node A Node A Node B Master condition event symptom Action (load) (swap) (connect) (blacklist) Transient Master Node A Node B swap not Logs and Monitoring Logs logged! UIs HBaseCon 2012. 5/22/12 18 Copyright 2012 Cloudera Inc. All rights reserved
  • 19. DOs and DON’Ts for keeping HBase Healthy DOs DON’Ts • Monitor and Alert • Swap • Optimize network • Oversubscribe MR • Know your logs • Share the network HBaseCon 2012. 5/22/12 Copyright 2012 Cloudera Inc. All rights reserved 19
  • 20. Outline• Preventative HBase Medicine: • Tips for a healthy HBase• The HBase Triage: • Fixes for acute HBase pains• The HBase Surgery: • Repairing a Corrupted HBase HBaseCon 2012. 5/22/12 20 Copyright 2012 Cloudera Inc. All rights reserved
  • 21. “Cloudera 911 here, how can we help?”
  • 22. HBase Support Tickets 28% HBase, ZK, MR, HDFS Misconfig 44% Patch Required Fix HW/NW Repair Needed 16% 12% HBaseCon 2012. 5/22/12 22 Copyright 2012 Cloudera Inc. All rights reserved
  • 23. Understanding the logs helps us diagnose issues• Related events logged by different processes in different places• Log messages point at each other • HDFS accesses by RS logged by NN and DN • HBase accesses by MR logged by JT, RS, NN, ZK • ZK logs indicate HBase health HBaseCon 2012. 5/22/12 23 Copyright 2012 Cloudera Inc. All rights reserved
  • 24. The HBase Triage: Fixes for acute HBase pains• Severe Pain• Complete Unconsciousness Copyright 2012 Cloudera Inc. All rights reserved 24
  • 25. The HBase Triage: Fixes for acute HBase pains• Severe Pain• Complete Unconsciousness Copyright 2012 Cloudera Inc. All rights reserved 25
  • 26. Connection ResetWARN - Session <id> for server <server App MRid>, unexpected error, closing socketconnection and attempting reconnectjava.io.IOException: Connection reset by peer HBaseWhat causes this? ZK HDFS• Running out of ZK connections JVM / LinuxHow can it be resolved?• Manually close connections Disk / Network• Fixed in HBASE-5466 and HBASE-4773 HBaseCon 2012. 5/22/12 26 Copyright 2012 Cloudera Inc. All rights reserved
  • 27. Running out of DN Threads & File DescriptorsINFO hdfs.DFSClient: Could not obtain block <blk App MRid> from any node: java.io.IOException: No livenodes contain current block.ERROR java.io.IOException: Too many open files HBaseWhat causes this? ZK HDFS• HBase likes to keep data files open JVM / LinuxHow can it be resolved?• Increase dfs.datanode.max.xcievers to 4096 Disk / Network• Increase /etc/security/limits.conf • hbase - nofile 32768 HBaseCon 2012. 5/22/12 27 Copyright 2012 Cloudera Inc. All rights reserved
  • 28. “Long Garbage Collecting Pause”WARN org.apache.hadoop.hbase.util.Sleeper: App MRWe slept 19118ms instead of 1000ms, this islikely due to a long garbage collecting pause andits usually bad HBaseHow can it be resolved? ZK HDFS• zoo.cfg: maxSessionTimeout=180000 hbase-site.xml: zookeeper.session.timeout=180000 JVM / Linux• Oversubscribed if MR & HBase are co-located Disk / Network HBaseCon 2012. 5/22/12 28 Copyright 2012 Cloudera Inc. All rights reserved
  • 29. Heap Allocation Per Node (Map + Red) x Child Heap + DN heap + Total RAM TT heap + RS heap + OS (20% of RAM) HBaseCon 2012. 5/22/12 29 Copyright 2012 Cloudera Inc. All rights reserved
  • 30. The HBase Triage: Fixes for acute HBase pains• Severe Pain• Complete Unconsciousness Copyright 2012 Cloudera Inc. All rights reserved 30
  • 31. ZK can’t start & HBase hangsINFO org.apache.hadoop.hdfs.DFSClient: Could App MRnot complete file <name> retrying… HBaseWhat causes this?• High dfs.replication.min causes HBase hang - can’t close file until created all replicas ZK HDFSHow can it be resolved?• Remove dfs.replication.min JVM / Linux• Temp increase dfs.balance.bandwidthPerSec• Fixed in HDFS-2936 Disk / Network HBaseCon 2012. 5/22/12 31 Copyright 2012 Cloudera Inc. All rights reserved
  • 32. Unable to Load DatabaseFATAL App MRorg.apache.zookeeper.server.quorum.QuorumPeer: Unable to load database on disk HBaseWhat causes this?• ZK data directories filled up ZK HDFSHow can it be resolved? JVM / Linux• Wipe out /var/zookeeper/version-2• Run zkCleanup.sh script via cron Disk / Network HBaseCon 2012. 5/22/12 32 Copyright 2012 Cloudera Inc. All rights reserved
  • 33. Downed HBase Master and RegionServersWARN App MRorg.apache.zookeeper.server.quorum.Learner:Exception when following the leaderjava.net.SocketTimeoutException: Read timed out HBaseWhat causes this? ZK HDFS• Session Timeout + Session Expiration = NW Prob JVM / LinuxHow can it be resolved?• Monitor network (e.g. ifconfig) Disk / Network• Run ≥ 3 ZK servers (majority rules) HBaseCon 2012. 5/22/12 33 Copyright 2012 Cloudera Inc. All rights reserved
  • 34. The HBase Triage: Fixes for acute HBase pains• Severe Pain• Complete Unconsciousness Copyright 2012 Cloudera Inc. All rights reserved 34
  • 35. Outline• Preventative HBase Medicine: • Tips for a healthy HBase• The HBase Triage: • Fixes for acute HBase pains• The HBase Surgery: • Repairing a Corrupted HBase HBaseCon 2012. 5/22/12 35 Copyright 2012 Cloudera Inc. All rights reserved
  • 36. “To the operating room, please”• Hbase refuses to start• Hbase’s HBCK reports inconsistencies 36
  • 37. HBase Support Tickets 28% HBase, ZK, MR, HDFS Misconfig 44% Patch Required Fix HW/NW Repair Needed 16% 12% HBaseCon 2012. 5/22/12 37 Copyright 2012 Cloudera Inc. All rights reserved
  • 38. HBase Support Tickets 28% HBase, ZK, MR, HDFS Misconfig 44% Patch Required Fix HW/NW Repair Needed 16% 12% HBaseCon 2012. 5/22/12 38 Copyright 2012 Cloudera Inc. All rights reserved
  • 39. Detecting internal problems with hbck• HBase since 0.90 has included a tool for scanning an HBase instance’s internals to find corruptions.hbase hbckhbase hbck -details Copyright 2012 Cloudera Inc. All rights reserved 39
  • 40. Tables are sharded into regions 0000000000 1111111111 2222222222 *‘’, A) 0000000000 1111111111 2222222222 [A, B) 3333333333 3333333333 4444444444 4444444444 5555555555 5555555555 *B, ‘’) 6666666666 6666666666 7777777777 7777777777 Invariants: Maintain table integrity and region consistency ! HBaseCon 2012. 5/22/12 40 Copyright 2012 Cloudera Inc. All rights reserved
  • 41. Table Integrity Invariants *‘ ‘,A) • Every key shall get assigned to a [A,B) single region. [B, C) [C, D) • Table Regions shall: [D, E) • Cover the entire range of possible [E, F) keys, [F, G) • from the absolute start (‘’) *G, ‘ ‘) • to the absolute end (unfortunately, also ‘’). HBaseCon 2012. 5/22/12 41 Copyright 2012 Cloudera Inc. All rights reserved
  • 42. Region Consistency Invariants info:regioninfo in META Orphans      Region Consistent   Assigned on  .regioninfo Region server in HDFS HBaseCon 2012. 5/22/12 42 Copyright 2012 Cloudera Inc. All rights reserved
  • 43. Repairing internal problems with hbck• Newer and upcoming versions of HBase Look’s like you’ve include an hbck that can broken an invariant fix internal problem as well as detect. • 0.90.7 • 0.92.2 • 0.94.0 • CDH3u4+ • CDH4b2+ Copyright 2012 Cloudera Inc. All rights reserved 43
  • 44. Bad region assignment info:regioninfo in META Orphans Region Consistent Assigned on .regioninfo Region server in HDFS hbck -fix (0.90.x) hbck –fixAssignments (0.90.7+, 0.92.2+, 0.94+) HBaseCon 2012. 5/22/12 44 Copyright 2012 Cloudera Inc. All rights reserved
  • 45. Region not in META info:regioninfo in META Orphans Region Consistent Assigned on .regioninfo Region server in HDFS hbck –fixAssignments -fixMeta HBaseCon 2012. 5/22/12 45 Copyright 2012 Cloudera Inc. All rights reserved
  • 46. Regioninfo not in HDFS info:regioninfo in META Orphans Region Consistent Assigned on .regioninfo Region server in HDFS hbck –fixAssignments -fixMeta HBaseCon 2012. 5/22/12 46 Copyright 2012 Cloudera Inc. All rights reserved
  • 47. Table Regions must not have holes *‘ ‘,A) • Where to I put row key “CRUD”? [A,B) [B, C) • Where is region [C,D)? ? [D, E) • Repair: [E, F) • Find the orphan and adopt it. [F, G) *G, ‘ ‘) • Fabricate a new region to fill the hole # NOTE! HBase should be idle (no get/put/split/compacts) hbck –fixHdfsHoles –fixHdfsOrphans –fixAssignments -fixMeta HBaseCon 2012. 5/22/12 47 Copyright 2012 Cloudera Inc. All rights reserved
  • 48. Table Regions must not overlap *‘ ‘,A) • Hm.. Which region should [A,B) ? ? “BAD” go? [B, D) [B,C) [C, D) • Is it [B, D) or is it [B,C)? [D, E) • Likely due to a bad split. [E, F) • Repair: [F, G) *G, ‘ ‘) • Merge regions or, • Sideline and bulk load # NOTE! HBase should be idle (no get/put/split/compacts) hbck –fixHdfsOverlaps –fixAssignments -fixMeta HBaseCon 2012. 5/22/12 48 Copyright 2012 Cloudera Inc. All rights reserved
  • 49. Consistency problem summary info:regioninfo in META Orphans Region Consistent Assigned on .regioninfo Region server in HDFS hbck –fixAssignments –fixMeta –fixHdfsHoles –fixHdfsOrphans – fixHdfsOverlaps HBaseCon 2012. 5/22/12 49 Copyright 2012 Cloudera Inc. All rights reserved
  • 50. Investigating further• HFile – examine contents of HFiles• Hlog – examine contents of HLog file• OfflineMetaRepair – Rebuild meta table from file system.• Also, some scripts for manual repairs: https://github.com/jmhsieh/hbase-repair-scripts HBaseCon 2012. 5/22/12 50 Copyright 2012 Cloudera Inc. All rights reserved
  • 51. Outline• Preventative HBase Medicine: • Tips for a healthy HBase• The HBase Triage: • Fixes for acute HBase pains• The HBase Surgery: • Repairing a Corrupted HBase HBaseCon 2012. 5/22/12 51 Copyright 2012 Cloudera Inc. All rights reserved
  • 52. Questions? HBaseCon 2012. 5/22/12 52Copyright 2012 Cloudera Inc. All rights reserved