Operating HBase –
Things You Need to Know
       Christian Gügi
Outline
●   HBase internals
●   Overview of HBase utilities
●   HBase split visualisation with Hannibal
●   Challenges & lessons learned
●   Resources to get started




                                              2
About me
●   Software Architect @ Sentric
●   Founder and organizer of the Swiss Big
    Data User Group
    http://www.bigdata-usergroup.ch

●   Contact:
    christian.guegi@sentric.ch
    http://www.sentric.ch
    @chrisgugi

                                             3
HBase Internals




                  4
Data Model
●   A sparse, multi-dimensional, sorted map
●   Table consist of rows, each has a row key
●   Each row may have any number of columns
●   Rows are sorted lexicographically based on row key
●   Column = Column Family : Column Qualifier
    –   Cell → {rowkey, column, timestamp}




                                    [Bigtable: A Distributed Storage System for Structured Data]

●   Region: contiguous set of sorted rows
●   Region: unit of distribution and availability                                                  5
Physical Data Organization
    Region
                      content Column Family        anchor Column Family

                   Store                         Store
(WAL on HFDS)




                                 Memstore                        Memstore
    HLog




                        HFile         HFile            HFile
                     (on HDFS)     (on HDFS)        (on HDFS)




●      Column families are stored separately on disk
          –     Unit of access control with different patterns
●      Writes are held (sorted) in memory until flush
●      Sorted on disk in predictable order
          –     By row key, column key, descending timestamp                6
Flushes and Compaction
●   Flushing/compaction per Region
    –   One thread (CompactSplitThread) per region
        server
●   Minor compaction
    –   Merges two or more HFiles into one
●   Major compaction
    –   Picks up all HFiles in the region, merges them and
        removes deleted k/v
●   Regions are split when grown too large

                                                             7
System Architecture

           HBase                        API


                                    RegionServer
                 Master
                                        HFile      Memstore
                                        Write-Ahead Log




                    HDFS                        ZooKeeper



    [HBase: The Definitive Guide]

                                                              8
Key Design & Distribution
●   Bad idea: continuous number or timestamp
    (sequential row keys)
    –   RegionServer hot-spotting
●   Better: use hash function and/or composite
    key
    –   Distribute keys over random regions
    –   Uniform reads/writes across key space
●   Proper key design is very essential
    –   E.g. reversed URL (Bigtable paper)
                                                 9
Overview
HBase Utilities




                  10
Useful Tools
●   hbck – checks and fixes table integrity and
    region consistency
●   HFile – examine contents of HFile
●   HLog – examine contents of HLog file
●   OfflineMetaRepair – rebuild meta table
    from file system
●   HBase web interfaces
    –   Master
    –   RegionsServer
                                                  11
Monitoring Tools
●   Ganglia
●   Nagios
●   OpenTSDB
●   …

    All tools use metrics provided through JMX




                                                 12
Manual Splitting
●   Via master web interface
    –   Split
●   HBase shell split command
●   RegionSplitter
    –   Create table with pre-split regions
    –   Rolling split of all regions on existing table
    –   . /bin/hbase
        org.apache.hadoop.hbase.util.RegionSplitter


                                                         13
Disable Automatic Splitting
●   Determined by hbase.hregion.max.filesize
●   Set to max. 100GB
●   OK, but:
    –   How do I monitor my region growth?
    –   Where do I split when I have irregular data
        growth?




                                                      14
HBase Split Visualisation
    with Hannibal




                            15
Hannibal
●   Open source, project on github
    – https://github.com/sentric/hannibal
●   Web based
●   Implemented in Scala
●   Compatible with HBase 0.90
●   Support > 0.92 added soon
●   Check it out!

                                            16
How well are regions balanced
over the cluster?




                                17
How well are the regions split for
the table?




                                     18
How did the region evolve over
time?




                                 19
Future Plans
●   HBase 0.92 client API changes allow to
    query Compaction-State on Regions
    through HBaseAdmin → differentiate major
    from minor compactions
●   Add tool to find best region-key for irregular
    data growth
●   Expose metrics through JMX



                                                     20
Challenges
& Lessons Learned




                    21
Challenges
●   Everyone is still learning
●   Some issues only appear at scale
    –   At scale, nothing works as advertised
●   Production cluster configuration
    –   Hardware issues
    –   Tuning cluster configuration to our work loads
●   HBase stability
●   Monitoring health of HBase
                                                         22
Lessons Learned
●   Schema & key design
    –   What’s queried together should be stored together
●   Monitoring/Operational tooling is most important
●   Forget “emergency actions”, it takes some time
●   You need DevOps in production
●   Huge know-how curve, you need to know the whole
    ecosystem
    –   Hadoop, HDFS, Map/Red, ZooKeeper



                                                            23
Resources to get started
●   https://github.com/sentric/hannibal
●   http://hbase.apache.org/book.html
●   https://github.com/jmhsieh/hbase-repair-
    scripts
●   http://www.sentric.ch/blog/best-practice-
    why-monitoring-hbase-is-important
●   HBase: The Definitive Guide


                                                24
Thank you!



       Questions?
             @chrisgugi




                          25

Apachecon Europe 2012: Operating HBase - Things you need to know

  • 1.
    Operating HBase – ThingsYou Need to Know Christian Gügi
  • 2.
    Outline ● HBase internals ● Overview of HBase utilities ● HBase split visualisation with Hannibal ● Challenges & lessons learned ● Resources to get started 2
  • 3.
    About me ● Software Architect @ Sentric ● Founder and organizer of the Swiss Big Data User Group http://www.bigdata-usergroup.ch ● Contact: christian.guegi@sentric.ch http://www.sentric.ch @chrisgugi 3
  • 4.
  • 5.
    Data Model ● A sparse, multi-dimensional, sorted map ● Table consist of rows, each has a row key ● Each row may have any number of columns ● Rows are sorted lexicographically based on row key ● Column = Column Family : Column Qualifier – Cell → {rowkey, column, timestamp} [Bigtable: A Distributed Storage System for Structured Data] ● Region: contiguous set of sorted rows ● Region: unit of distribution and availability 5
  • 6.
    Physical Data Organization Region content Column Family anchor Column Family Store Store (WAL on HFDS) Memstore Memstore HLog HFile HFile HFile (on HDFS) (on HDFS) (on HDFS) ● Column families are stored separately on disk – Unit of access control with different patterns ● Writes are held (sorted) in memory until flush ● Sorted on disk in predictable order – By row key, column key, descending timestamp 6
  • 7.
    Flushes and Compaction ● Flushing/compaction per Region – One thread (CompactSplitThread) per region server ● Minor compaction – Merges two or more HFiles into one ● Major compaction – Picks up all HFiles in the region, merges them and removes deleted k/v ● Regions are split when grown too large 7
  • 8.
    System Architecture HBase API RegionServer Master HFile Memstore Write-Ahead Log HDFS ZooKeeper [HBase: The Definitive Guide] 8
  • 9.
    Key Design &Distribution ● Bad idea: continuous number or timestamp (sequential row keys) – RegionServer hot-spotting ● Better: use hash function and/or composite key – Distribute keys over random regions – Uniform reads/writes across key space ● Proper key design is very essential – E.g. reversed URL (Bigtable paper) 9
  • 10.
  • 11.
    Useful Tools ● hbck – checks and fixes table integrity and region consistency ● HFile – examine contents of HFile ● HLog – examine contents of HLog file ● OfflineMetaRepair – rebuild meta table from file system ● HBase web interfaces – Master – RegionsServer 11
  • 12.
    Monitoring Tools ● Ganglia ● Nagios ● OpenTSDB ● … All tools use metrics provided through JMX 12
  • 13.
    Manual Splitting ● Via master web interface – Split ● HBase shell split command ● RegionSplitter – Create table with pre-split regions – Rolling split of all regions on existing table – . /bin/hbase org.apache.hadoop.hbase.util.RegionSplitter 13
  • 14.
    Disable Automatic Splitting ● Determined by hbase.hregion.max.filesize ● Set to max. 100GB ● OK, but: – How do I monitor my region growth? – Where do I split when I have irregular data growth? 14
  • 15.
  • 16.
    Hannibal ● Open source, project on github – https://github.com/sentric/hannibal ● Web based ● Implemented in Scala ● Compatible with HBase 0.90 ● Support > 0.92 added soon ● Check it out! 16
  • 17.
    How well areregions balanced over the cluster? 17
  • 18.
    How well arethe regions split for the table? 18
  • 19.
    How did theregion evolve over time? 19
  • 20.
    Future Plans ● HBase 0.92 client API changes allow to query Compaction-State on Regions through HBaseAdmin → differentiate major from minor compactions ● Add tool to find best region-key for irregular data growth ● Expose metrics through JMX 20
  • 21.
  • 22.
    Challenges ● Everyone is still learning ● Some issues only appear at scale – At scale, nothing works as advertised ● Production cluster configuration – Hardware issues – Tuning cluster configuration to our work loads ● HBase stability ● Monitoring health of HBase 22
  • 23.
    Lessons Learned ● Schema & key design – What’s queried together should be stored together ● Monitoring/Operational tooling is most important ● Forget “emergency actions”, it takes some time ● You need DevOps in production ● Huge know-how curve, you need to know the whole ecosystem – Hadoop, HDFS, Map/Red, ZooKeeper 23
  • 24.
    Resources to getstarted ● https://github.com/sentric/hannibal ● http://hbase.apache.org/book.html ● https://github.com/jmhsieh/hbase-repair- scripts ● http://www.sentric.ch/blog/best-practice- why-monitoring-hbase-is-important ● HBase: The Definitive Guide 24
  • 25.
    Thank you! Questions? @chrisgugi 25