Introduction to HBase
Byeongweon Moon / REDDUCK
byeongweon.moon@reddduck.com
HBase Key Point
 Clustered, commodity(-ish) hardware
 Mostly schema-less
 Dynamic distribution
 Spread writes out over the cluster
HBase
 Distributed database modeled on Bigtable
   Bigtable : A Distributed Storage System for
    Structured Data by Chang et al.
 Runs on top of Hadoop Core
 Layers on HDFS for storage
 Native connections to MapReduce
 Distributed, High Availability, High
  Performance, Strong Consistency
HBase (cont.)
 Column-oriented store
    Wide table costs only the data stored
    NULLs in row are ‘free’
    Good compression: columns of similar type
    Column name is arbitrary
 Rows stored in sorted order
 Can random read and write
 Goal of billions of rows X millions of cells
    Petabytes of data across thousands of servers
Column Oriented Storage
!HBase
 “NoSQL” Database
    No joins
    No sophisticated query engine
    No transactions (sort of)
    No column typing
    No SQL, no ODBC/JDBC, etc.
 Not a replacement for RDBMS
 Matching Impedance
Why HBase?
 Datasets are reaching Petabytes
 Traditional databases are expensive to scale
  and difficult to distribute
 Commodity hardware is cheap and powerful
 Need for random access and batch
  processing (which Hadoop does not offer)
Tables
 Table is split into roughly equal sized
  “regions”
 Each region is a contiguous range of keys
 Regions split as they grow, thus dynamically
  adjusting to your data set
Table (cont.)
 Tables are sorted by Row
 Table schema defines column families
    Families consist of any number of columns
    Columns consist of any number of versions
    Everything except table name is byte[]


(Table, Row, Family:Column, Timestamp) -> Value
Table (cont.)
 As a data structrue

  SortedMap(
        RowKey, List(
              SortedMap(
                    Column, List(
                          Value, Timestamp
                    )
              )
        )
  )
HBase Open Source Stack

 ZooKeeper : Small Data Coordination Service
 HBase : Database Storage Engine
 HDFS : Distributed File system
 Hadoop : Asynchrous Map-Reduce Jobs
Server Architecture
 Similar to HDFS
    Master == Namenode
    Regionserver == Datanode
 Often run these alongside each other!
 Difference: HBase stores state in HDFS
 HDFS provides robust data storage across
  machines, insulating against failure
 Master and Regionserver fairly stateless and
  machine independent
Region Assignment
 Each region from every table is assigned to a
  Regionserver
 Master Duties:
   Responsible for assignment and handling
      regionserver problems (if any!)
     When machines fail, move regions
     When regions split, move regions to balance
     Could move regions to respond to load
     Can run multiple backup masters
Master
 The master does NOT
    Handle any write request (not a DB master!)
    Handle location finding requests
    Not involved in the read/write path
    Generally does very little most of the time
Distributed Coordination
 Zookeeper is used to manage master
  election and server availability
 Set up as a cluster, provides distributed
  coordination primitives
 An excellent tool for building cluster
  management systems
HBase Architecture




http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
How data actually stored
Write-ahead-Log




http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
HLog
Demo
HBase - Roadmap
 HBase 0.92.0
   Coprocessors
   Distributed Log Splitting
   Running Tasks in UI
   Performance Improvements
 HBase 0.94.0
   Security
   Secondary Indexes
   Search Integration
   HFile v2
Reference
 http://ofps.oreilly.com/titles/9781449396107/
  index.html
 http://hbase.apache.org/book.html#quicksta
  rt
 http://www.larsgeorge.com/2010/02/fosdem-
  2010-nosql-talk.html

Introduction to HBase

  • 1.
    Introduction to HBase ByeongweonMoon / REDDUCK byeongweon.moon@reddduck.com
  • 2.
    HBase Key Point Clustered, commodity(-ish) hardware  Mostly schema-less  Dynamic distribution  Spread writes out over the cluster
  • 3.
    HBase  Distributed databasemodeled on Bigtable  Bigtable : A Distributed Storage System for Structured Data by Chang et al.  Runs on top of Hadoop Core  Layers on HDFS for storage  Native connections to MapReduce  Distributed, High Availability, High Performance, Strong Consistency
  • 4.
    HBase (cont.)  Column-orientedstore  Wide table costs only the data stored  NULLs in row are ‘free’  Good compression: columns of similar type  Column name is arbitrary  Rows stored in sorted order  Can random read and write  Goal of billions of rows X millions of cells  Petabytes of data across thousands of servers
  • 5.
  • 6.
    !HBase  “NoSQL” Database  No joins  No sophisticated query engine  No transactions (sort of)  No column typing  No SQL, no ODBC/JDBC, etc.  Not a replacement for RDBMS  Matching Impedance
  • 7.
    Why HBase?  Datasetsare reaching Petabytes  Traditional databases are expensive to scale and difficult to distribute  Commodity hardware is cheap and powerful  Need for random access and batch processing (which Hadoop does not offer)
  • 8.
    Tables  Table issplit into roughly equal sized “regions”  Each region is a contiguous range of keys  Regions split as they grow, thus dynamically adjusting to your data set
  • 9.
    Table (cont.)  Tablesare sorted by Row  Table schema defines column families  Families consist of any number of columns  Columns consist of any number of versions  Everything except table name is byte[] (Table, Row, Family:Column, Timestamp) -> Value
  • 10.
    Table (cont.)  Asa data structrue SortedMap( RowKey, List( SortedMap( Column, List( Value, Timestamp ) ) ) )
  • 11.
    HBase Open SourceStack  ZooKeeper : Small Data Coordination Service  HBase : Database Storage Engine  HDFS : Distributed File system  Hadoop : Asynchrous Map-Reduce Jobs
  • 12.
    Server Architecture  Similarto HDFS  Master == Namenode  Regionserver == Datanode  Often run these alongside each other!  Difference: HBase stores state in HDFS  HDFS provides robust data storage across machines, insulating against failure  Master and Regionserver fairly stateless and machine independent
  • 13.
    Region Assignment  Eachregion from every table is assigned to a Regionserver  Master Duties:  Responsible for assignment and handling regionserver problems (if any!)  When machines fail, move regions  When regions split, move regions to balance  Could move regions to respond to load  Can run multiple backup masters
  • 14.
    Master  The masterdoes NOT  Handle any write request (not a DB master!)  Handle location finding requests  Not involved in the read/write path  Generally does very little most of the time
  • 15.
    Distributed Coordination  Zookeeperis used to manage master election and server availability  Set up as a cluster, provides distributed coordination primitives  An excellent tool for building cluster management systems
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    HBase - Roadmap HBase 0.92.0  Coprocessors  Distributed Log Splitting  Running Tasks in UI  Performance Improvements  HBase 0.94.0  Security  Secondary Indexes  Search Integration  HFile v2
  • 22.
    Reference  http://ofps.oreilly.com/titles/9781449396107/ index.html  http://hbase.apache.org/book.html#quicksta rt  http://www.larsgeorge.com/2010/02/fosdem- 2010-nosql-talk.html