Introduction to HBase

Introduction to HBase
Byeongweon Moon / REDDUCK
byeongweon.moon@reddduck.com

HBase Key Point
 Clustered, commodity(-ish) hardware
 Mostly schema-less
 Dynamic distribution
 Spread writes out over the cluster

HBase
 Distributed database modeled on Bigtable
 Bigtable : A Distributed Storage System for
Structured Data by Chang et al.
 Runs on top of Hadoop Core
 Layers on HDFS for storage
 Native connections to MapReduce
 Distributed, High Availability, High
Performance, Strong Consistency

HBase (cont.)
 Column-oriented store
 Wide table costs only the data stored
 NULLs in row are ‘free’
 Good compression: columns of similar type
 Column name is arbitrary
 Rows stored in sorted order
 Can random read and write
 Goal of billions of rows X millions of cells
 Petabytes of data across thousands of servers

!HBase
 “NoSQL” Database
 No joins
 No sophisticated query engine
 No transactions (sort of)
 No column typing
 No SQL, no ODBC/JDBC, etc.
 Not a replacement for RDBMS
 Matching Impedance

Why HBase?
 Datasets are reaching Petabytes
 Traditional databases are expensive to scale
and difficult to distribute
 Commodity hardware is cheap and powerful
 Need for random access and batch
processing (which Hadoop does not offer)

Tables
 Table is split into roughly equal sized
“regions”
 Each region is a contiguous range of keys
 Regions split as they grow, thus dynamically
adjusting to your data set

Table (cont.)
 Tables are sorted by Row
 Table schema defines column families
 Families consist of any number of columns
 Columns consist of any number of versions
 Everything except table name is byte[]

(Table, Row, Family:Column, Timestamp) -> Value

Table (cont.)
 As a data structrue

SortedMap(
RowKey, List(
SortedMap(
Column, List(
Value, Timestamp
)
)
)
)

HBase Open Source Stack

 ZooKeeper : Small Data Coordination Service
 HBase : Database Storage Engine
 HDFS : Distributed File system
 Hadoop : Asynchrous Map-Reduce Jobs

Server Architecture
 Similar to HDFS
 Master == Namenode
 Regionserver == Datanode
 Often run these alongside each other!
 Difference: HBase stores state in HDFS
 HDFS provides robust data storage across
machines, insulating against failure
 Master and Regionserver fairly stateless and
machine independent

Region Assignment
 Each region from every table is assigned to a
Regionserver
 Master Duties:
 Responsible for assignment and handling
regionserver problems (if any!)
 When machines fail, move regions
 When regions split, move regions to balance
 Could move regions to respond to load
 Can run multiple backup masters

Master
 The master does NOT
 Handle any write request (not a DB master!)
 Handle location finding requests
 Not involved in the read/write path
 Generally does very little most of the time

Distributed Coordination
 Zookeeper is used to manage master
election and server availability
 Set up as a cluster, provides distributed
coordination primitives
 An excellent tool for building cluster
management systems

HBase Architecture

http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

Write-ahead-Log

http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html

HBase - Roadmap
 HBase 0.92.0
 Coprocessors
 Distributed Log Splitting
 Running Tasks in UI
 Performance Improvements
 HBase 0.94.0
 Security
 Secondary Indexes
 Search Integration
 HFile v2

Reference
 http://ofps.oreilly.com/titles/9781449396107/
index.html
 http://hbase.apache.org/book.html#quicksta
rt
 http://www.larsgeorge.com/2010/02/fosdem-
2010-nosql-talk.html

Introduction to HBase

More Related Content

What's hot

Viewers also liked

Similar to Introduction to HBase

More from Byeongweon Moon

Introduction to HBase