Apachecon Europe 2012: Operating HBase - Things you need to know

Operating HBase –
Things You Need to Know
Christian Gügi

Outline
● HBase internals
● Overview of HBase utilities
● HBase split visualisation with Hannibal
● Challenges & lessons learned
● Resources to get started

2

About me
● Software Architect @ Sentric
● Founder and organizer of the Swiss Big
Data User Group
http://www.bigdata-usergroup.ch

● Contact:
christian.guegi@sentric.ch
http://www.sentric.ch
@chrisgugi

3

Data Model
● A sparse, multi-dimensional, sorted map
● Table consist of rows, each has a row key
● Each row may have any number of columns
● Rows are sorted lexicographically based on row key
● Column = Column Family : Column Qualifier
– Cell → {rowkey, column, timestamp}

[Bigtable: A Distributed Storage System for Structured Data]

● Region: contiguous set of sorted rows
● Region: unit of distribution and availability 5

Physical Data Organization
Region
content Column Family anchor Column Family

Store Store
(WAL on HFDS)

Memstore Memstore
HLog

HFile HFile HFile
(on HDFS) (on HDFS) (on HDFS)

● Column families are stored separately on disk
– Unit of access control with different patterns
● Writes are held (sorted) in memory until flush
● Sorted on disk in predictable order
– By row key, column key, descending timestamp 6

Flushes and Compaction
● Flushing/compaction per Region
– One thread (CompactSplitThread) per region
server
● Minor compaction
– Merges two or more HFiles into one
● Major compaction
– Picks up all HFiles in the region, merges them and
removes deleted k/v
● Regions are split when grown too large

7

System Architecture

HBase API

RegionServer
Master
HFile Memstore
Write-Ahead Log

HDFS ZooKeeper

[HBase: The Definitive Guide]

8

Key Design & Distribution
● Bad idea: continuous number or timestamp
(sequential row keys)
– RegionServer hot-spotting
● Better: use hash function and/or composite
key
– Distribute keys over random regions
– Uniform reads/writes across key space
● Proper key design is very essential
– E.g. reversed URL (Bigtable paper)
9

Overview
HBase Utilities

10

Useful Tools
● hbck – checks and fixes table integrity and
region consistency
● HFile – examine contents of HFile
● HLog – examine contents of HLog file
● OfflineMetaRepair – rebuild meta table
from file system
● HBase web interfaces
– Master
– RegionsServer
11

Monitoring Tools
● Ganglia
● Nagios
● OpenTSDB
● …

All tools use metrics provided through JMX

12

Manual Splitting
● Via master web interface
– Split
● HBase shell split command
● RegionSplitter
– Create table with pre-split regions
– Rolling split of all regions on existing table
– . /bin/hbase
org.apache.hadoop.hbase.util.RegionSplitter

13

Disable Automatic Splitting
● Determined by hbase.hregion.max.filesize
● Set to max. 100GB
● OK, but:
– How do I monitor my region growth?
– Where do I split when I have irregular data
growth?

14

HBase Split Visualisation
with Hannibal

15

Hannibal
● Open source, project on github
– https://github.com/sentric/hannibal
● Web based
● Implemented in Scala
● Compatible with HBase 0.90
● Support > 0.92 added soon
● Check it out!

16

How well are regions balanced
over the cluster?

17

How well are the regions split for
the table?

18

How did the region evolve over
time?

19

Future Plans
● HBase 0.92 client API changes allow to
query Compaction-State on Regions
through HBaseAdmin → differentiate major
from minor compactions
● Add tool to find best region-key for irregular
data growth
● Expose metrics through JMX

20

Challenges
& Lessons Learned

21

Challenges
● Everyone is still learning
● Some issues only appear at scale
– At scale, nothing works as advertised
● Production cluster configuration
– Hardware issues
– Tuning cluster configuration to our work loads
● HBase stability
● Monitoring health of HBase
22

Lessons Learned
● Schema & key design
– What’s queried together should be stored together
● Monitoring/Operational tooling is most important
● Forget “emergency actions”, it takes some time
● You need DevOps in production
● Huge know-how curve, you need to know the whole
ecosystem
– Hadoop, HDFS, Map/Red, ZooKeeper

23

Resources to get started
● https://github.com/sentric/hannibal
● http://hbase.apache.org/book.html
● https://github.com/jmhsieh/hbase-repair-
scripts
● http://www.sentric.ch/blog/best-practice-
why-monitoring-hbase-is-important
● HBase: The Definitive Guide

24

Thank you!

Questions?
@chrisgugi

25

Apachecon Europe 2012: Operating HBase - Things you need to know

More Related Content

What's hot

Viewers also liked

Similar to Apachecon Europe 2012: Operating HBase - Things you need to know

More from Christian Gügi

Recently uploaded

Apachecon Europe 2012: Operating HBase - Things you need to know