Hbase

Big Data and Hadoop Training
HBASE

Classification: Restricted
Agenda
•HBase Introduction
•Row & Column storage
•Characteristics of a huge DB
•What is HBase?
•HBase Data-Model
•HBase vs RDBMS
•HBase architecture
•HBase in operation
•Loading Data into HBase
•HBase shell commands
•HBase operations through Java
•HBase operations through MR

What is Hbase?
• Open source project built on top of Apache Hadoop
• NoSQL database
• Distributed, scalable store
• Column-family datastore

How do you pick Sql or NoSql?
• What does your data look like?
• Is your data model likely to change?
• Is your data growing exponentially?
• Will you be doing real-time analytics on operational data?

Inspiration for Hbase
•Google’s BigTable is the inspiration for Hbase
•It is designed to run on a cluster of computers.
Characteristics of Big Table:
•Data is ‘Sparse’
•Data is stored as a ‘Sorted Map’
•‘Distributed’
•‘Multi-dimensional’
•‘Consistent’

Hbase vs RDBMS
HBase RDBMS
Data that is accessed together is stored
together
Data is normalized
Column-oriented Row-oriented(mostly)
Flexible schema, can add columns on
the fly
Fixed schema
Good with Sparse tables Not optimized for sparse tables
No Joins Optimized for joins
Horizontal Scalability Hard to shard and scale
Good for structured, semi-structured
data
Good for structured data
Row-based transactions Distributed transactions

Row & Column - Storage
•Column oriented store – For specific queries, not all values of a table are
needed (analytical databases)
•Advantages of Column-oriented storage:
•Reduced I/O
•Values of columns in the logical rows are similar – better suited for
compression

Hbase Data - Model
Component Description
Table Data organized into tables; comprised rows
Row key Data stored in rows; Rows identified by Rowkeys;
Primary key; Rows are sorted by this value
Column family Columns are grouped into families
Column Qualifier Identifies the column
Cell Combination of the rowkey, column family, colum, timestamp;
contains the value
Version Values within cell versioned by version number  timestamp

Hbase Data Model

Hbase Data - Model
• Regions – Horizontal partitions of a Hbase Table.
• A Region is denoted by the Table it belongs to, it’s first row(inclusive), last
row(exclusive)
• Regions are the units that get distributed over an entire cluster.
• Initially, a table comprises a single region, but as the region grows it eventually
crosses a configurable size threshold, at which point it splits at a row boundary
into two new regions of approximately equal size

Hbase Architecture

• Hbase Master – master node
• Regionservers – slave nodes
• Hbase Master
• bootstraps a virgin install,
• assigns regions to registered regionservers,
• recovers regionserver failures
• Regionservers
• carry zero or more regions
• take client read/write requests
• Manage region splits – informs master about the new daughter regions
Hbase Architecture

• ZooKeeper – Authority on the cluster state
• Hbase – location of catalog table & cluster master
• Assignment of regions is mediated via Zookeeper in case servers crash mid-
assignment
• Hbase Client must know the location of the zookeeper ensemble.
• Thereafter, client navigates the zookeeper hierarchy to learn cluster attributes
such as server lcoations.
Hbase Architecture

• hbase:meta – list, state & locations of all regions on the cluster.
• Entries in hbase:meta are keyed by region name
• Region name – table name of the region, region’s start row, time of
creation, and MD5 hash of all of these.
• Eg: TestTable,xyz,1279729913622.1b6e176fb8d8aa88fd4ab6bc80247ece.
• As row keys are sorted, finding the region that hosts a particular key is easy
• Whenever region(s) split, enabled, disabled, deleted etc., the catalog table is
updated.
Hbase in Operation

• Fresh clients connect to Zookeeper cluster to get the location of hbase:meta
 To figure out hosting user-space regions and its locations.
• Then, clients interact directly with regionservers.
• Clients cache their previous operations – works fine until there is a fault.
• If fault happens, clients contact hbase:meta again. If this has also moved,
clients will contact Zookeeper.
• Writes arriving at a regionserver are first appended to a commit log and then
added to an in-memory memstore. When a memstore fills, its content is
flushed to the filesystem
Hbase in Operation

• When reading, the region’s memstore is consulted first. If sufficient versions
are found reading memstore alone, the query completes there. Otherwise,
flush files are consulted in order, from newest to oldest, either until versions
sufficient to satisfy the query are found or until we run out of flush files.
Hbase in Operation

• Using HBase shell
• Using Client APIs
• Using Pig
• Using Sqoop
Loading Data Into Hbase

Hbase Shell commands

Hbase Shell Commands

Connect to Hbase from Clients

Hbase Use cases
•Capturing incremental data – Time series data – High Volume, Velocity
Writes
•eg: Sensor, system metrics, events, stock prices, server logs, rainfall data
•Information Exchange – High Volume, Velocity Write/Read
•eg: email, chat
•Content serving, web Application Backend – High Volume, Velocity Reads
•eg: ebay, groupon

Thank You

Hbase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hbase

Similar to Hbase (20)

Recently uploaded

Recently uploaded (20)

Hbase