Hbase

The Hadoop database, a distributed, scalable, big data store.

Agenda
NoSQL Recap

Research Background

Table Components

High Level Architecture

Data Modeling Patterns

Challenges

RDBMS vs NoSQL
Relational Databases (PostgreSQL, MySQL…) have the
fl
exibility to answer any question,
which means they are optimized for none

Typically scaling writes needs a lot of engineering when using RDBMS

Most NoSQL DBs take a di
ff
erent approach and to answer one or few questions, in a
optimized scalable manner.

RDBMS vs NoSQL
Relational Databases are ACID(Atomic, Consistent, Isolated, Durable) which is a great property however it is hard to scale.

Leader accepts writes, followers accept reads. Scaling writes means we need to shard the database which is di
ffi
cult and has drawbacks.
Leader
Follower Follower

Research Background & Use Cases
Hbase is primarily based on two seminal papers in distributed systems:

• The Hadoop File System

• Bigtable: A Distributed Storage System for Structured Data

Use cases
• Hbase is a good choice for consistent, linearly scalable read & write on
large tables with billions x millions of columns scalable to PBs of data
without sacri
fi
cing performance (millisecond latency).
• Real time messaging App

• Gmail like Email Service

• Timeseries Data

• Firehose Application

Hbase Characteristics
• Hbase is consistent

• Hbase is a key value store with ability to do random read/write, scans (and
partial scans)

• Hbase data is sorted based on row key

• Data is stored as Byte Array, It is applications job to serialize or deserialize
to types or objects.

Row Key
Region Server
Address
Key,Table Region Server
A..B, table xyz 192.168.1..
C…D, table bx 192.168.20..
Meta Table
Region
Server

192.168.1..
Region 1
 
Table with row key
A
Region 2
 
Table with row key
B
Region
Server

192.168.20..
Region 1
 
Table with row key
A
Region 2
 
Table with row key
B
Has the mapping between keys and
region servers

Data Modeling
Most important aspect of Hbase Data Modeling is how you de
fi
ne the row key

Row key most be random and evenly distributed so all region servers can serve the
load evenly and avoid (hot partitioning/ hot spotting)
Examples of Bad Row Keys: 
First name & Last name (“John-Smith”)

Url(“www.yahoo.com", “www.youtube.com")

Patterns to Avoid Hot Partitions
Hashing the key

www.yahoo.com => 1b03577ed104f16aadc00a639d33cb44

www.youtube.com => ab3201c6103205c14f6e56b11b2fcd46

Salting

Adding a random number to su
ffi
x of the key to distribute the same partition

www.yahoo.com-10240

www.yahoo.com-10213

Partial Scans & Data Modeling
Hbase supports scans on partial row key or regexes

Example:

Row_keys:

yahoo.com

youtube.com

fb.com
Scan query with y* will

return both youTube and yahoo

Suppose we want to create a service like Gmail to store emails:

Data modeling approach 1:

user_uuid:email_uuid

example: 0db72126-59fb-4e70-85f9-c82fce62c1e5:f15fe2db-09cc-4456-bc9a-
d87b2a81d1b2
Now if we want to see all emails for this user we can partial scan with uuid*

Scan: 0db72126-59fb-4e70-85f9-c82fce62c1e5:*

Data modeling approach 2: Adding Time into the key

user_uuid:year:month:email_uuid

example: 0db72126-59fb-4e70-85f9-c82fce62c1e5:2020:07:f15fe2db-09cc-4456-b
d87b2a81d1b2
Now if we can also scan based on time range if we wantScan:
0db72126-59fb-4e70-85f9-c82fce62c1e5:2020:*

Or 0db72126-59fb-4e70-85f9-c82fce62c1e5:2020:07*

Challenges
Hbase Operations is hard to maintain, it has a lot of components.

Hbase is not friendly to a lot of delete operations because it stores data as LSM Tree

Which leads to tombstones

Hbase

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Hbase

Similar to Hbase (20)

Recently uploaded

Recently uploaded (20)

Hbase